⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-10-21 更新
PokeeResearch: Effective Deep Research via Reinforcement Learning from AI Feedback and Robust Reasoning Scaffold
Authors:Yi Wan, Jiuqi Wang, Liam Li, Jinsong Liu, Ruihao Zhu, Zheqing Zhu
Tool-augmented large language models (LLMs) are emerging as deep research agents, systems that decompose complex queries, retrieve external evidence, and synthesize grounded responses. Yet current agents remain limited by shallow retrieval, weak alignment metrics, and brittle tool-use behavior. We introduce PokeeResearch-7B, a 7B-parameter deep research agent built under a unified reinforcement learning framework for robustness, alignment, and scalability. PokeeResearch-7B is trained by an annotation-free Reinforcement Learning from AI Feedback (RLAIF) framework to optimize policies using LLM-based reward signals that capture factual accuracy, citation faithfulness, and instruction adherence. A chain-of-thought-driven multi-call reasoning scaffold further enhances robustness through self-verification and adaptive recovery from tool failures. Among 10 popular deep research benchmarks, PokeeResearch-7B achieves state-of-the-art performance among 7B-scale deep research agents. This highlights that careful reinforcement learning and reasoning design can produce efficient, resilient, and research-grade AI agents. The model and inference code is open-sourced under MIT license at https://github.com/Pokee-AI/PokeeResearchOSS.
工具增强型大型语言模型(LLM)正逐渐发展为深度研究代理,这些系统能够分解复杂查询、检索外部证据并合成基于事实的回答。然而,当前代理仍受到浅层检索、弱对齐指标和工具使用行为不稳定等的限制。我们推出PokeeResearch-7B,这是一个在统一强化学习框架下构建的7B参数深度研究代理,旨在实现稳健性、对齐性和可扩展性。PokeeResearch-7B通过无标注的强化学习从人工智能反馈(RLAIF)框架进行训练,优化策略使用LLM为基础的奖励信号,以捕捉事实准确性、引用忠诚度和指令遵循度。基于思维链驱动的多呼叫推理架构通过自我验证和自适应从工具故障中恢复,进一步增强了稳健性。在10个流行的深度研究基准测试中,PokeeResearch-7B在7B规模的深度研究代理中实现了最先进的性能。这强调了精心设计的强化学习和推理可以产生高效、坚韧和研究级的人工智能代理。模型和推理代码以MIT许可证在https://github.com/Pokee-AI/PokeeResearchOSS开源。
论文及项目相关链接
Summary
大型语言模型(LLM)正在成为深度研究代理工具,它们能够分解复杂查询、检索外部证据并合成有根据的回应。然而,当前代理工具仍存在浅层检索、弱对齐指标和工具使用行为脆弱等限制。我们推出了PokeeResearch-7B,这是一个在统一强化学习框架下构建的7B参数深度研究代理工具,旨在实现稳健性、对齐性和可扩展性。PokeeResearch-7B通过无标注的强化学习从人工智能反馈(RLAIF)框架进行优化,使用LLM奖励信号来捕捉事实准确性、引用忠诚度和指令依从性,以优化策略。思维驱动的多重调用推理架构通过自我验证和自适应恢复工具故障,进一步增强了稳健性。在10个流行的深度研究基准测试中,PokeeResearch-7B在7B规模深度研究代理中实现了最先进的性能。这突显了精心设计的强化学习和推理可以产生高效、坚韧和研究级的人工智能代理。模型和推理代码已在MIT许可下开源。
Key Takeaways
- 大型语言模型(LLM)正成为深度研究代理工具,能够处理复杂查询、检索证据并合成回应。
- 当前LLM存在浅层检索、弱对齐和工具使用脆弱等限制。
- PokeeResearch-7B是一个在强化学习框架下构建的深度研究代理,旨在提高稳健性、对齐性和可扩展性。
- PokeeResearch-7B采用无标注强化学习从人工智能反馈(RLAIF)框架,优化LLM奖励信号以捕捉事实准确性、引用忠诚度和指令依从性。
- 思维驱动的多重调用推理架构增强了PokeeResearch-7B的稳健性,通过自我验证和自适应恢复工具故障。
- 在多个基准测试中,PokeeResearch-7B达到了先进性能水平,突显了强化学习和精心设计推理的重要性。
点此查看论文截图
InfiMed-ORBIT: Aligning LLMs on Open-Ended Complex Tasks via Rubric-Based Incremental Training
Authors:Pengkai Wang, Qi Zuo, Pengwei Liu, Zhijie Sang, Congkai Xie, Hongxia Yang
Large Language Models (LLMs) have shown substantial advances through reinforcement learning (RL), particularly in domains where rewards can be programmatically verified, such as mathematics and code. In these areas, models benefit from a well-defined operational base guided by explicit rule-based objectives. However, this progress reveals a significant limitation: in open-ended domains where rewards are ambiguous, subjective, or context-dependent, such as creative writing, scientific reasoning, and notably medical consultation, robust reward functions are lacking, making these areas challenging for current RL strategies. To bridge this gap, we introduce ORBIT, an open-ended rubric-based incremental training framework specifically designed for high-stakes medical dialogue. ORBIT integrates syn- thetic dialogue generation with the dynamic creation of rubrics, employing these rubrics to direct an incremental RL process. In particular, this approach does not depend on external medical knowledge or manual rules, instead utilizing rubric-guided feedback to shape learning. When implemented on the Qwen3-4B-Instruct model, our method can greatly enhance its performance on the HealthBench-Hard benchmark from 7.0 to 27.2 using only 2k samples, thus achieving state-of-the-art results for models of this scale. Our analysis confirms that rubric-driven RL fos-ters consistent performance gains across diverse consultation scenarios, going beyond simple numerical improvements. These findings underscore rubric-based feedback as a scalable strategy for advancing LLMs in intricate, open-ended tasks.
大规模语言模型(LLM)通过强化学习(RL)取得了显著进展,特别是在奖励可以通过程序验证的领域,如数学和代码。在这些领域中,模型受益于由明确基于规则的目标引导的良好定义的操作基础。然而,这种进展揭示了一个重大局限:在奖励模糊、主观或依赖于上下文的无限制领域,如创意写作、科学推理和尤其是医疗咨询,缺乏稳健的奖励功能,这使得这些领域对当前的RL策略构成挑战。为了弥补这一差距,我们引入了ORBIT,这是一个专门针对高风险医疗对话的开放式评分基准增量训练框架。ORBIT将合成对话生成与动态创建评分基准相结合,利用这些评分基准来指导增量RL过程。特别是,这种方法不依赖外部医学知识或手动规则,而是利用评分基准指导的反馈来塑造学习。当在Qwen3-4B-Instruct模型上实施时,我们的方法仅使用2k个样本就能将其在HealthBench-Hard基准测试上的性能从7.0提高到27.2,从而为这一规模的模型达到了最先进的成果。我们的分析证实,评分驱动RL在多种咨询场景中促进了性能的稳定提升,超越了简单的数字改进。这些发现强调,基于评分基准的反馈是推动大规模语言模型在无限制复杂任务中进步的可扩展策略。
论文及项目相关链接
PDF 17 pages, 6 figures
Summary:大型语言模型(LLM)通过强化学习(RL)在奖励可程序验证的领域取得了显著进展,如数学和代码。然而,在奖励模糊、主观或依赖于上下文的开放领域,如创意写作、科学推理和医疗咨询等,当前的RL策略面临挑战。为了弥补这一差距,提出了ORBIT,一个专为高风险医疗对话设计的开放式评分基准增量训练框架。ORBIT通过合成对话生成与动态创建评分标准的集成,利用这些评分标准指导增量RL过程。在Qwen3-4B-Instruct模型上实施后,该方法可以在仅使用2k样本的情况下,将HealthBench-Hard基准测试的性能从7.0提高到27.2,从而在如此规模的模型中实现了业界领先的结果。分析表明,评分驱动RL在多种咨询场景中实现了性能的稳定提升。
Key Takeaways:
- 大型语言模型在强化学习领域取得显著进展,特别是在奖励可程序验证的领域,如数学和代码。
- 在开放领域,如创意写作、科学推理和医疗咨询等,由于奖励的模糊性,当前强化学习策略面临挑战。
- ORBIT是一个专为高风险医疗对话设计的开放式评分基准增量训练框架,旨在解决上述挑战。
- ORBIT通过合成对话生成与动态创建评分标准的集成,利用这些评分标准指导增量强化学习过程。
- 在Qwen3-4B-Instruct模型上实施ORBIT后,性能显著提高,实现了业界领先的结果。
- 评级驱动RL在多种咨询场景中表现出稳定的性能提升。
点此查看论文截图
BLIP3o-NEXT: Next Frontier of Native Image Generation
Authors:Jiuhai Chen, Le Xue, Zhiyang Xu, Xichen Pan, Shusheng Yang, Can Qin, An Yan, Honglu Zhou, Zeyuan Chen, Lifu Huang, Tianyi Zhou, Junnan Li, Silvio Savarese, Caiming Xiong, Ran Xu
We present BLIP3o-NEXT, a fully open-source foundation model in the BLIP3 series that advances the next frontier of native image generation. BLIP3o-NEXT unifies text-to-image generation and image editing within a single architecture, demonstrating strong image generation and image editing capabilities. In developing the state-of-the-art native image generation model, we identify four key insights: (1) Most architectural choices yield comparable performance; an architecture can be deemed effective provided it scales efficiently and supports fast inference; (2) The successful application of reinforcement learning can further push the frontier of native image generation; (3) Image editing still remains a challenging task, yet instruction following and the consistency between generated and reference images can be significantly enhanced through post-training and data engine; (4) Data quality and scale continue to be decisive factors that determine the upper bound of model performance. Building upon these insights, BLIP3o-NEXT leverages an Autoregressive + Diffusion architecture in which an autoregressive model first generates discrete image tokens conditioned on multimodal inputs, whose hidden states are then used as conditioning signals for a diffusion model to generate high-fidelity images. This architecture integrates the reasoning strength and instruction following of autoregressive models with the fine-detail rendering ability of diffusion models, achieving a new level of coherence and realism. Extensive evaluations of various text-to-image and image-editing benchmarks show that BLIP3o-NEXT achieves superior performance over existing models.
我们推出BLIP3o-NEXT,这是BLIP3系列中的全新开源基础模型,进一步推动了原生图像生成的边界。BLIP3o-NEXT统一了文本到图像生成和图像编辑,在一个架构内展现了强大的图像生成和图像编辑能力。在开发最先进的原生图像生成模型时,我们发现了四个关键见解:首先,大多数架构选择都能产生相当的性能;一个架构只要能够高效扩展并支持快速推理,就可以被认为是有效的。其次,强化学习的成功应用可以进一步推动原生图像生成的边界。第三,图像编辑仍然是一项具有挑战性的任务,但通过事后训练和数据中心,指令遵循和生成图像与参考图像之间的一致性可以得到显着增强。最后,数据的质量和规模仍然是决定模型性能上限的决定性因素。基于这些见解,BLIP3o-NEXT采用自回归+扩散架构,其中自回归模型首先根据多模式输入生成离散图像令牌,其隐藏状态然后用作扩散模型的条件信号来生成高保真图像。该架构将自回归模型的推理能力和指令执行能力与扩散模型的精细细节渲染能力相结合,实现了新的连贯性和逼真度。对各种文本到图像和图像编辑基准的广泛评估表明,BLIP3o-NEXT在现有模型中实现了卓越的性能。
论文及项目相关链接
摘要
BLIP3o-NEXT是一个开源的基金模型,属于BLIP3系列,推动了原生图像生成的新边界。该模型在文本到图像生成和图像编辑方面表现出强大的能力。在开发最先进的原生图像生成模型时,我们获得了四个关键见解。首先,大多数架构选择都能产生相当的性能;其次,强化学习的成功应用可以进一步推动原生图像生成的前沿;第三,图像编辑仍然是一项具有挑战性的任务,但通过事后培训和数据引擎,指令遵循和生成图像与参考图像之间的一致性可以得到显着增强;最后,数据的质量和规模仍然是决定模型性能上限的决定性因素。基于这些见解,BLIP3o-NEXT采用自回归+扩散架构,自回归模型首先根据多模式输入生成离散图像标记,其隐藏状态然后用作扩散模型的调节信号来生成高质量图像。这一架构将自回归模型的推理能力和指令与扩散模型的精细细节渲染能力相结合,实现了新的连贯性和逼真度水平。对多种文本到图像和图像编辑基准的广泛评估表明,BLIP3o-NEXT在现有模型中实现了卓越的性能。
关键见解
- BLIP3o-NEXT是一个开源的原生图像生成模型,融合了文本到图像生成和图像编辑功能。
- 模型展现出强大的图像生成和编辑能力,基于自回归+扩散架构。
- 大多数架构选择都能产生相当的性能,关键在于效率、快速推理和有效支持。
- 强化学习的应用进一步推动了原生图像生成的发展。
- 图像编辑仍具挑战性,但通过事后培训和数据引擎,指令遵循和一致性有所提升。
- 数据质量和规模对模型性能起决定性作用。
- 模型在多种文本到图像和图像编辑基准测试中表现出卓越性能。
点此查看论文截图
Paper2Web: Let’s Make Your Paper Alive!
Authors:Yuhang Chen, Tianpeng Lv, Siyi Zhang, Yixiang Yin, Yao Wan, Philip S. Yu, Dongping Chen
Academic project websites can more effectively disseminate research when they clearly present core content and enable intuitive navigation and interaction. However, current approaches such as direct Large Language Model (LLM) generation, templates, or direct HTML conversion struggle to produce layout-aware, interactive sites, and a comprehensive evaluation suite for this task has been lacking. In this paper, we introduce Paper2Web, a benchmark dataset and multi-dimensional evaluation framework for assessing academic webpage generation. It incorporates rule-based metrics like Connectivity, Completeness and human-verified LLM-as-a-Judge (covering interactivity, aesthetics, and informativeness), and PaperQuiz, which measures paper-level knowledge retention. We further present PWAgent, an autonomous pipeline that converts scientific papers into interactive and multimedia-rich academic homepages. The agent iteratively refines both content and layout through MCP tools that enhance emphasis, balance, and presentation quality. Our experiments show that PWAgent consistently outperforms end-to-end baselines like template-based webpages and arXiv/alphaXiv versions by a large margin while maintaining low cost, achieving the Pareto-front in academic webpage generation.
学术项目网站在清晰呈现核心内容、提供直观导航和交互功能时,能更有效地传播研究。然而,当前的方法,如直接大型语言模型(LLM)生成、模板或直接HTML转换,很难产生布局意识强、交互性强的站点,并且针对此任务的全面评估套件一直缺失。在本文中,我们介绍了Paper2Web,这是一套用于评估学术网页生成的基准数据集和多维度评估框架。它结合了基于规则的指标,如连通性、完整性以及经过人工验证的LLM-as-a-Judge(涵盖交互性、美观性和信息性),还有PaperQuiz,用于衡量论文级别的知识保留情况。我们进一步推出了PWAgent,这是一个将科学论文转化为丰富多媒体的交互式学术主页的自主管道。该代理通过MCP工具迭代优化内容和布局,这些工具增强了重点、平衡和呈现质量。我们的实验表明,PWAgent在学术网页生成方面始终大幅超越了基于模板的网页和arXiv/alphaXiv版本等端到端的基准线,同时保持低成本,实现了帕累托最优前沿。
论文及项目相关链接
PDF Under Review. Check https://github.com/YuhangChen1/Paper2All for the unified platform to streamline all academic presentation
Summary
研究指出,学术项目网站可以通过清晰呈现核心内容、实现直观导航和互动来更有效地传播研究。然而,现有的方法如直接大型语言模型(LLM)生成、模板或直接HTML转换,难以生成具有布局意识和互动性的网站,且缺乏对此任务的全面评估套件。为此,本文介绍了Paper2Web,这是一套用于评估学术网页生成的基准数据集和多维度评估框架。它包含了基于规则的指标,如连通性、完整性和经过人工验证的LLM法官(涵盖互动性、美观性和信息量),以及PaperQuiz,可衡量论文层面的知识保留情况。此外,本文还提出了PWAgent,这是一种将科学论文转化为互动性和多媒体丰富的学术主页的自主管道。该代理通过增强重点、平衡和呈现质量的MCP工具来不断改进内容和布局。实验表明,PWAgent在保持低成本的同时,大幅优于基于模板的网页和arXiv/alphaXiv版本等端到端的基线方案,在学术网页生成方面达到了帕累托前沿。
Key Takeaways
- 学术项目网站需要清晰呈现核心内容和直观导航与互动以更有效地传播研究。
- 当前学术网页生成方法面临生成布局意识和互动性网站的挑战。
- 缺乏全面评估学术网页生成的评估套件。
- Paper2Web是一套新的基准数据集和多维度评估框架,用于评估学术网页生成的效果。
- Paper2Web包含基于规则的指标和经过人工验证的LLM法官以及用于衡量论文层面知识保留情况的PaperQuiz。
- PWAgent是一种自主管道,能将科学论文转化为互动性和多媒体丰富的学术主页。
点此查看论文截图
Neuro-Symbolic Spatial Reasoning in Segmentation
Authors:Jiayi Lin, Jiabo Huang, Shaogang Gong
Open-Vocabulary Semantic Segmentation (OVSS) assigns pixel-level labels from an open set of categories, requiring generalization to unseen and unlabelled objects. Using vision-language models (VLMs) to correlate local image patches with potential unseen object categories suffers from a lack of understanding of spatial relations of objects in a scene. To solve this problem, we introduce neuro-symbolic (NeSy) spatial reasoning in OVSS. In contrast to contemporary VLM correlation-based approaches, we propose Relational Segmentor (RelateSeg) to impose explicit spatial relational constraints by first order logic (FOL) formulated in a neural network architecture. This is the first attempt to explore NeSy spatial reasoning in OVSS. Specifically, RelateSeg automatically extracts spatial relations, e.g., <cat, to-right-of, person>, and encodes them as first-order logic formulas using our proposed pseudo categories. Each pixel learns to predict both a semantic category (e.g., “cat”) and a spatial pseudo category (e.g., “right of person”) simultaneously, enforcing relational constraints (e.g., a “cat” pixel must lie to the right of a “person”). Finally, these logic constraints are formulated in a deep network architecture by fuzzy logic relaxation, enabling end-to-end learning of spatial-relationally consistent segmentation. RelateSeg achieves state-of-the-art performance in terms of average mIoU across four benchmark datasets and particularly shows clear advantages on images containing multiple categories, with the cost of only introducing a single auxiliary loss function and no additional parameters, validating the effectiveness of NeSy spatial reasoning in OVSS.
开放词汇语义分割(OVSS)为来自开放类别集的像素级标签分配标签,需要推广到未见过的未标记对象。使用视觉语言模型(VLM)来关联局部图像补丁与潜在未见对象类别,对于场景中对象的空间关系的理解存在缺陷。为了解决此问题,我们在OVSS中引入了神经符号(NeSy)空间推理。与当代基于VLM关联的方法不同,我们提出关系分段器(RelateSeg),它通过神经网络架构制定的一阶逻辑(FOL)来施加明确的空间关系约束。这是OVSS中探索NeSy空间推理的首次尝试。具体来说,RelateSeg自动提取空间关系,例如“猫,在人的右边”,并将其编码为一阶逻辑公式,使用我们提出的伪类别。每个像素都学习同时预测语义类别(例如“猫”)和空间伪类别(例如“人的右边”),强制执行关系约束(例如,“猫”像素必须位于“人”的右侧)。最后,这些逻辑约束通过模糊逻辑松弛在深层次的网络架构中制定,实现对空间关系一致分割的端到端学习。RelateSeg在四个基准数据集上的平均mIoU达到了最新技术水平,特别是在包含多个类别的图像上显示出明显优势,而只需引入一个辅助损失函数且无需添加额外参数,验证了NeSy空间推理在OVSS中的有效性。
论文及项目相关链接
Summary
本文介绍了开放词汇语义分割(OVSS)中的神经符号(NeSy)空间推理技术。针对使用视觉语言模型(VLM)在场景中理解对象空间关系的问题,提出了关系分割器(RelateSeg)。RelateSeg通过神经网络架构施加显式空间关系约束,采用一阶逻辑(FOL)公式表达。这是首次尝试在OVSS中探索NeSy空间推理。RelateSeg自动提取空间关系,如“猫在人的右边”,并将其编码为一阶逻辑公式。每个像素同时预测语义类别(如“猫”)和空间伪类别(如“人的右边”),强制执行关系约束。最后,这些逻辑约束通过模糊逻辑松弛在深度网络架构中表达,实现了空间关系一致的端到端学习分割。RelateSeg在四个基准数据集上的平均mIoU达到最佳性能,特别是在包含多个类别的图像上显示出明显优势,只需引入一个辅助损失函数且无需额外参数。
Key Takeaways
- 开放词汇语义分割(OVSS)需要模型对未见过的对象进行泛化,并分配像素级标签。
- 当前使用视觉语言模型(VLM)的方法在理解场景中的对象空间关系方面存在缺陷。
- 引入神经符号(NeSy)空间推理来解决这一问题,通过神经网络架构施加显式空间关系约束。
- 关系分割器(RelateSeg)能自动提取并编码空间关系,如“猫在人的右边”。
- RelateSeg同时预测像素的语义类别和空间伪类别,强制执行关系约束。
- 通过模糊逻辑松弛表达逻辑约束,实现空间关系一致的端到端学习分割。
点此查看论文截图
Demo: Guide-RAG: Evidence-Driven Corpus Curation for Retrieval-Augmented Generation in Long COVID
Authors:Philip DiGiacomo, Haoyang Wang, Jinrui Fang, Yan Leng, W Michael Brode, Ying Ding
As AI chatbots gain adoption in clinical medicine, developing effective frameworks for complex, emerging diseases presents significant challenges. We developed and evaluated six Retrieval-Augmented Generation (RAG) corpus configurations for Long COVID (LC) clinical question answering, ranging from expert-curated sources to large-scale literature databases. Our evaluation employed an LLM-as-a-judge framework across faithfulness, relevance, and comprehensiveness metrics using LongCOVID-CQ, a novel dataset of expert-generated clinical questions. Our RAG corpus configuration combining clinical guidelines with high-quality systematic reviews consistently outperformed both narrow single-guideline approaches and large-scale literature databases. Our findings suggest that for emerging diseases, retrieval grounded in curated secondary reviews provides an optimal balance between narrow consensus documents and unfiltered primary literature, supporting clinical decision-making while avoiding information overload and oversimplified guidance. We propose Guide-RAG, a chatbot system and accompanying evaluation framework that integrates both curated expert knowledge and comprehensive literature databases to effectively answer LC clinical questions.
随着人工智能聊天机器人在临床医学中的应用越来越广泛,针对新兴复杂疾病开发有效的框架面临重大挑战。我们为Long COVID(LC)的临床问答开发了六种检索增强生成(RAG)语料库配置,从专家精心挑选的源到大规模文献数据库不等。我们的评估使用了一种大型语言模型作为法官的框架,根据忠实度、相关性和全面性的指标,并利用专家生成的临床问题新型数据集LongCOVID-CQ进行评价。我们的RAG语料库配置结合了临床指南和高质量的系统评价,持续优于狭隘的单指南方法和大规模文献数据库。我们的研究结果表明,对于新兴疾病而言,基于精选二次评价的检索在狭窄共识文件和未筛选的一次文献之间提供了最佳平衡,既支持临床决策制定,又避免了信息过载和过于简化的指导。我们提出了Guide-RAG这一聊天机器人系统及其评估框架,融合了精选的专家知识和全面的文献数据库,能够有效地回答LC的临床问题。
论文及项目相关链接
PDF Accepted to 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: The Second Workshop on GenAI for Health: Potential, Trust, and Policy Compliance
Summary:随着人工智能聊天机器人在临床医学中的广泛应用,针对新兴疾病的复杂问题构建有效的框架面临巨大挑战。针对Long COVID(LC)的临床问答需求,研究团队开发了六种Retrieval-Augmented Generation(RAG)语料库配置方案,包括专家精选来源及大规模文献数据库等。通过LongCOVID-CQ这一新型专家生成的临床问题数据集,研究团队评估了其在忠实度、相关性和综合度方面的表现。结合临床指南与高质量系统性审查的RAG语料库配置方案表现最佳,既优于单一指南方法,也优于大规模文献数据库方法。研究认为,针对新兴疾病,基于精选二次审查的检索可提供最佳平衡,既支持临床决策制定,又避免信息过载和简化指导。研究团队提出了Guide-RAG聊天机器人系统和评估框架,旨在整合专家知识和综合文献数据库,有效回答LC的临床问题。
Key Takeaways:
- AI聊天机器人在临床医学中面对新兴疾病的挑战在于如何构建有效的框架来应对复杂性。
- 研究团队为Long COVID(LC)的临床问答需求开发了六种RAG语料库配置方案。
- 结合临床指南与高质量系统性审查的RAG语料库配置方案表现最佳。
- Guide-RAG聊天机器人系统和评估框架旨在整合专家知识和综合文献数据库。
- 检索方法需平衡临床决策支持、信息过载和简化指导的需求。
- 通过LongCOVID-CQ数据集评估了忠实度、相关性和综合度三个方面的表现。
点此查看论文截图
HEADER: Hierarchical Robot Exploration via Attention-Based Deep Reinforcement Learning with Expert-Guided Reward
Authors:Yuhong Cao, Yizhuo Wang, Jingsong Liang, Shuhao Liao, Yifeng Zhang, Peizhuo Li, Guillaume Sartoretti
This work pushes the boundaries of learning-based methods in autonomous robot exploration in terms of environmental scale and exploration efficiency. We present HEADER, an attention-based reinforcement learning approach with hierarchical graphs for efficient exploration in large-scale environments. HEADER follows existing conventional methods to construct hierarchical representations for the robot belief/map, but further designs a novel community-based algorithm to construct and update a global graph, which remains fully incremental, shape-adaptive, and operates with linear complexity. Building upon attention-based networks, our planner finely reasons about the nearby belief within the local range while coarsely leveraging distant information at the global scale, enabling next-best-viewpoint decisions that consider multi-scale spatial dependencies. Beyond novel map representation, we introduce a parameter-free privileged reward that significantly improves model performance and produces near-optimal exploration behaviors, by avoiding training objective bias caused by handcrafted reward shaping. In simulated challenging, large-scale exploration scenarios, HEADER demonstrates better scalability than most existing learning and non-learning methods, while achieving a significant improvement in exploration efficiency (up to 20%) over state-of-the-art baselines. We also deploy HEADER on hardware and validate it in complex, large-scale real-life scenarios, including a 300m*230m campus environment.
本文在环境规模和探索效率方面推动了基于学习的自主机器人探索方法的边界。我们提出了HEADER,这是一种基于注意力机制的强化学习方法,使用分层图进行大规模环境中的高效探索。HEADER遵循现有的传统方法为机器人的信念/地图构建分层表示,但进一步设计了一种基于社区的新算法来构建和更新全局图,该算法保持完全增量、形状自适应,并以线性复杂度运行。我们的规划器基于注意力网络,能够精细推理局部范围内的附近信念,同时粗略利用全局尺度上的远距离信息,从而能够考虑多尺度空间依赖性的下一个最佳视点决策。除了新的地图表示之外,我们还引入了一种无参数的特权奖励,该奖励显著提高了模型性能,并产生了接近最优的探索行为,避免了由于手工奖励形状造成的训练目标偏见。在模拟的具有挑战性、大规模的探索场景中,HEADER展示了比大多数现有的学习和非学习方法更好的可扩展性,同时在探索效率上实现了对最新技术的显著改善(高达20%)。我们还部署了HEADER在硬件上并在复杂的大规模真实场景中进行了验证,包括一个300米* 230米的校园环境。
论文及项目相关链接
Summary
本文介绍了一种基于注意力强化学习的方法HEADER,用于大规模环境中的高效机器人自主探索。HEADER构建了一种基于社区的全局图算法,具有增量性、形状自适应和线性运算复杂度等特性。它通过注意网络对近距离信息进行精细推理,同时对远距离信息进行粗略利用,考虑了多尺度空间依赖性。此外,引入了一种无参数的特权奖励,避免了手工奖励形状带来的训练目标偏见,显著提高了模型性能,实现了近最优的探索行为。在模拟的大规模探索场景中,HEADER表现出良好的可扩展性,相较于现有学习方法和非学习方法具有更高的探索效率(最高提升20%)。同时,在复杂的大规模真实场景中进行了硬件部署验证。
Key Takeaways
- HEADER是一种基于注意力强化学习的方法,用于大规模环境中的机器人自主探索。
- HEADER构建了一种基于社区的全局图算法,具有增量性、形状自适应和线性运算复杂度特性。
- 通过注意网络对近距离信息进行精细推理,同时对远距离信息进行粗略利用,考虑了多尺度空间依赖性。
- 引入了一种无参数的特权奖励,提高了模型性能并实现了近最优的探索行为。
- 在模拟的大规模探索场景中,HEADER相较于现有方法具有更高的探索效率。
- HEADER在复杂的大规模真实场景中具有实际应用价值。
点此查看论文截图
CarBoN: Calibrated Best-of-N Sampling Improves Test-time Reasoning
Authors:Yung-Chen Tang, Pin-Yu Chen, Andrea Cavallaro
Allocating more computation during inference time (test-time scaling) improves language model performance, especially for reasoning tasks. However, popular methods like Best-of-$N$ sampling often show diminishing returns as $N$ increases. To address this inefficiency, we introduce a general test-time calibration framework that adaptively modifies the model toward high-reward reasoning paths, with theoretical guarantees of improving the lower bound of expected reward under finite sampling, all without large language model (LLM) retraining. Within this framework, we propose CarBoN (Calibrated Best-of-$N$), a two-phase method that first explores the solution space and then learns a calibration of the logits via an input-specific temperature $T$ and additive shift vector $\delta$, guiding generation toward more reliable reasoning. Experiments on MATH-500 and AIME-2024 show that CarBoN improves efficiency, with up to $4\times$ fewer rollouts to reach the same accuracy, while often achieving higher accuracy under fixed budgets. We also analyze the complementary roles of $T$ and $\delta$ in balancing output diversity and correctness, and demonstrate that the framework also generalizes to step-level sampling strategies such as beam search. For more information, please refer to our project page at huggingface.co/spaces/TrustSafeAI/Test-Time-Calibration.
在推理时间(测试时间缩放)分配更多的计算资源可以提高语言模型性能,特别是对于推理任务。然而,如Best-of-$N$采样等流行方法往往随着$N$的增加而收益递减。为了解决这种低效问题,我们引入了一个通用的测试时间校准框架,该框架可以自适应地修改模型,使其朝向高回报的推理路径发展,并在有限采样的条件下,理论上保证提高预期回报的下限,所有这一切都不需要大规模语言模型(LLM)进行重新训练。在此框架下,我们提出了CarBoN(校准Best-of-$N$),这是一种两阶段的方法,首先探索解决方案空间,然后通过输入特定的温度$T$和添加偏移向量$\delta$来学习对数几率的校准,从而引导生成更可靠的推理。MATH-500和AIME-2024上的实验表明,CarBoN提高了效率,在达到相同准确率的情况下,滚动次数最多可减少$4\times$,同时在固定预算下往往实现更高的准确率。我们还分析了$T$和$\delta$在平衡输出多样性和正确性方面的互补作用,并证明该框架也适用于如集束搜索等步骤级采样策略。更多信息请参见我们的项目页面huggingface.co/spaces/TrustSafeAI/Test-Time-Calibration。
论文及项目相关链接
摘要
文本展示了如何在推理时间分配更多计算资源以提高语言模型性能,特别是在推理任务方面的性能。针对现有方法如Best-of-$N$采样中随着$N$增大收益递减的问题,文本提出了一种通用的测试时校准框架。该框架可在无需对大型语言模型进行再训练的情况下,自适应地调整模型以朝向高回报的推理路径,并理论上保证在有限采样下提高预期收益的下界。在此基础上,文本提出了CarBoN(校准的Best-of-$N$)方法,这是一个两阶段的方法,首先探索解空间,然后通过输入特定的温度$T$和加法偏移向量$\delta$来学习对数几率的校准,引导生成更可靠的推理。实验显示,CarBoN能提高效率,在达到相同准确率的情况下,可减少高达$4\times$的滚动次数,同时在固定预算下往往能达到更高的准确率。此外,文本还分析了$T$和$\delta$在平衡输出多样性和正确性方面的互补作用,并证明该框架也可推广至如beam search等步骤级采样策略。
关键见解
- 测试时分配更多计算资源可提高语言模型在推理任务上的性能。
- 现有方法如Best-of-$N$采样存在随着$N$增大收益递减的问题。
- 提出了一种通用的测试时校准框架,可在无需对大型语言模型进行再训练的情况下提高预期收益。
- 介绍了CarBoN方法,通过两阶段过程探索解空间并对对数几率进行校准。
- CarBoN能提高效率,减少滚动次数,同时保持或提高准确率。
- $T$和$\delta$在平衡输出多样性和正确性方面发挥互补作用。
- 框架可推广至其他采样策略,如beam search。
点此查看论文截图
HypoSpace: Evaluating LLM Creativity as Set-Valued Hypothesis Generators under Underdetermination
Authors:Tingting Chen, Beibei Lin, Zifeng Yuan, Qiran Zou, Hongyu He, Yew-Soon Ong, Anirudh Goyal, Dianbo Liu
As language models are increasingly used in scientific workflows, evaluating their ability to propose sets of explanations-not just a single correct answer-becomes critical. Many scientific problems are underdetermined: multiple, mechanistically distinct hypotheses are consistent with the same observations. We introduce HypoSpace, a diagnostic suite that treats LLMs as samplers of finite hypothesis sets and measures three complementary indicators: Validity (precision of proposals consistent with observations), Uniqueness (non-redundancy among proposals), and Recovery (coverage of the enumerated admissible set). We instantiate HypoSpace in three structured domains with deterministic validators and exactly enumerated hypothesis spaces: (i) causal graphs from perturbations, (ii) gravity-constrained 3D voxel reconstruction from top-down projections, and (iii) Boolean genetic interactions. Across instruction-tuned and reasoning-focused models, Validity often remains high while Uniqueness and Recovery degrade as the admissible space grows, revealing mode collapse that is invisible to correctness-only metrics. HypoSpace offers a controlled probe-rather than a leaderboard-for methods that explicitly explore and cover admissible explanation spaces. Code is available at: https://github.com/CTT-Pavilion/_HypoSpace.
随着语言模型在科研工作流程中越来越广泛的应用,评估它们提出一系列解释而非仅仅一个正确答案的能力变得至关重要。许多科学问题具有不确定性:多个机制上不同的假设与同一组观测结果相符。我们引入了HypoSpace,这是一种诊断工具套件,它将大型语言模型视为有限假设集的采样器,并测量三个互补指标:有效性(提案与观测结果的一致性程度)、唯一性(提案之间的非冗余性)和恢复性(所列举的可接受集合的覆盖率)。我们在三个结构化领域实例化HypoSpace,这些领域具有确定性验证器和确切枚举的假设空间:(i)来自扰动的因果图,(ii)受重力约束的从上到下投影的3D体素重建,以及(iii)布尔遗传相互作用。在指令调整和以推理为重点的模型中,随着可接受空间的增长,有效性往往保持较高,而唯一性和恢复性会下降,这揭示了模式崩溃,而这种崩溃只对正确的度量指标可见。HypoSpace为显式探索和覆盖可接受解释空间的方法提供了一个受控探针,而不是排行榜。代码可在以下网址找到:https://github.com/CTT-Pavilion/_HypoSpace。
论文及项目相关链接
Summary
语言模型在科学流程中的使用越来越普遍,评估它们提出多种解释集的能力(而不仅仅是单一的正确答案)变得至关重要。针对这种情况,我们推出了HypoSpace诊断工具套件。该套件将大型语言模型视为有限假设集的采样器,并测量三个互补指标:有效性(提案与观察结果的一致性程度)、唯一性(提案之间的非冗余性)和恢复性(对列举的可接受集合的覆盖率)。我们在三个结构化领域实例化了HypoSpace,这些领域具有确定性验证器和精确列举的假设空间:因果图扰动、受重力约束的三维体素重建从上到下投影以及布尔基因相互作用。在指令调优和注重推理的模型中,随着可行空间的增长,有效性往往保持高位,而唯一性和恢复性会下降,这揭示了模式崩溃的现象,这种现象在只关注正确性的指标中是看不见的。HypoSpace为明确探索和覆盖可行解释空间的方法提供了一个受控探针,而不是排行榜。相关代码已发布在:https://github.com/CTT-Pavilion/_HypoSpace。
Key Takeaways
- 语言模型在评估其提出多种解释集的能力方面变得重要,而不仅仅是给出单一正确答案。
2.HypoSpace作为诊断工具套件,能够将大型语言模型视为假设集的采样器。 - 三个关键指标用于评估语言模型的性能:有效性、唯一性和恢复性。
- 在不同结构化领域中实例化了HypoSpace,包括因果图扰动、三维体素重建和布尔基因相互作用。
- 在可行空间增长的情况下,有效性保持高位,而唯一性和恢复性可能会下降,暴露出模式崩溃的问题。
- 与仅关注正确性的指标相比,HypoSpace提供了一个更全面的评估方法。
点此查看论文截图
Unleashing Scientific Reasoning for Bio-experimental Protocol Generation via Structured Component-based Reward Mechanism
Authors:Haoran Sun, Yankai Jiang, Zhenyu Tang, Yaning Pan, Shuang Gu, Zekai Lin, Lilong Wang, Wenjie Lou, Lei Liu, Lei Bai, Xiaosong Wang
The foundation of reproducible science lies in protocols that are precise, logically ordered, and executable. The autonomous generation of these protocols through natural language queries could greatly improve the efficiency of the reproduction process. However, current leading large language models (LLMs) often generate incomplete or inconsistent protocols, limiting their utility. To address this limitation, we first introduce SciRecipe, a large-scale dataset of over 12K structured protocols spanning 27 biological subfields and encompassing both comprehension and problem-solving tasks. To further improve protocol generation, we propose the “Sketch-and-Fill” paradigm, which separates analysis, structuring, and expression to ensure each step is explicit and verifiable. Complementing this, the structured component-based reward mechanism evaluates step granularity, action order, and semantic fidelity, aligning model optimization with experimental reliability. Building on these components, we develop Thoth, trained through a staged Knowledge-to-Action process that progresses from knowledge acquisition to operational reasoning and ultimately to robust, executable protocol generation. Across multiple benchmarks, Thoth consistently surpasses both proprietary and open-source LLMs, achieving significant improvements in step alignment, logical sequencing, and semantic accuracy. Our approach paves the way for reliable scientific assistants that bridge knowledge with experimental execution. All data, code, and models will be released publicly.
可重复科学的基石在于精确、逻辑有序且可执行的协议。通过自然语言查询自主生成这些协议可以大大提高再现过程的效率。然而,当前主流的大型语言模型(LLM)经常生成不完整或不一致的协议,限制了其效用。为了解决这一局限性,我们首先引入了SciRecipe,这是一个包含超过12K个结构化协议的大规模数据集,涵盖27个生物学子领域,涵盖理解和解决问题两种任务。为了进一步优化协议生成,我们提出了“草图填充”范式,它将分析、结构和表达分离,以确保每个步骤都是明确且可验证的。作为补充,结构化组件基础上的奖励机制评估步骤粒度、动作顺序和语义保真度,使模型优化与实验可靠性相一致。基于这些组件,我们开发了托特(Thoth),通过分阶段的知识到行动过程进行训练,从知识获取进步到操作推理,最终实现稳健、可执行的协议生成。在多个基准测试中,Thoth在步骤对齐、逻辑顺序和语义准确性方面始终超过了专有和开源的LLM,取得了显著的改进。我们的方法为可靠的科研助手铺平了道路,这些助手能够将知识与实验执行联系起来。所有数据、代码和模型都将公开发布。
论文及项目相关链接
Summary
本文介绍了科学协议的可重复性基础,以及通过自然语言查询自主生成这些协议的重要性。针对当前大型语言模型在生成完整、一致协议方面的局限性,提出了SciRecipe数据集和“Sketch-and-Fill”范式,以及结构化组件奖励机制。在此基础上,开发了Thoth,它在多个基准测试中表现优异,实现了步骤对齐、逻辑排序和语义准确性的显著提高。该研究为可靠的科学助手的发展铺平了道路,将知识与实验执行相结合。
Key Takeaways
- 协议的可重复性对于科学研究至关重要,需要精确、逻辑有序和执行性。
- 当前大型语言模型在生成完整和一致的科学协议方面存在局限性。
- SciRecipe数据集是一个包含超过12K结构化协议的大规模数据集,涵盖27个生物学子领域,包含理解和解决问题的任务。
- “Sketch-and-Fill”范式通过分离分析、结构和表达,确保每一步明确且可验证。
- 结构化组件奖励机制评估步骤粒度、动作顺序和语义保真度,使模型优化与实验可靠性相一致。
- Thoth通过分阶段的知识到行动过程进行训练,从知识获取到操作推理,最终实现稳健、可执行的协议生成。
点此查看论文截图
JudgeSQL: Reasoning over SQL Candidates with Weighted Consensus Tournament
Authors:Jiayuan Bai, Xuan-guang Pan, Chongyang Tao, Shuai Ma
Text-to-SQL is a pivotal task that bridges natural language understanding and structured data access, yet it remains fundamentally challenging due to semantic ambiguity and complex compositional reasoning. While large language models (LLMs) have greatly advanced SQL generation though prompting, supervised finetuning and reinforced tuning, the shift toward test-time scaling exposes a new bottleneck: selecting the correct query from a diverse candidate pool. Existing selection approaches, such as self-consistency or best-of-$N$ decoding, provide only shallow signals, making them prone to inconsistent scoring, fragile reasoning chains, and a failure to capture fine-grained semantic distinctions between closely related SQL candidates. To this end, we introduce JudgeSQL, a principled framework that redefines SQL candidate selection through structured reasoning and weighted consensus tournament mechanism. JudgeSQL develops a reasoning-based SQL judge model that distills reasoning traces with reinforcement learning guided by verifiable rewards, enabling accurate and interpretable judgments. Building on this, a weighted consensus tournament integrates explicit reasoning preferences with implicit generator confidence, yielding selections that are both more reliable and more efficient. Extensive experiments on the BIRD benchmark demonstrate that JudgeSQL exhibits superior SQL judgment capabilities and good cross-scale generalization and robustness to generator capacity.
文本到SQL是一项连接自然语言理解和结构化数据访问的关键任务,但由于语义模糊和复杂的组合推理,它仍然具有根本的挑战性。虽然大型语言模型(LLM)通过提示、监督微调、强化训练等方式极大地推动了SQL生成的发展,但面向测试时间缩放的趋势暴露了一个新的瓶颈:从多样化的候选池中选取正确的查询。现有的选择方法,如自我一致性或最佳N解码,只提供浅层的信号,使它们容易遇到评分不一致、推理链脆弱以及无法捕获与紧密相关的SQL候选之间的细微语义区别等问题。为此,我们引入了JudgeSQL,这是一个通过结构化推理和加权共识锦标赛机制重新定义SQL候选选择的原则性框架。JudgeSQL开发了一个基于推理的SQL判断模型,用强化学习引导可验证的奖励进行推理轨迹的提炼,从而实现准确和可解释的判断。在此基础上,加权共识锦标赛结合了明确的推理偏好和隐式的生成器信心,产生了既更可靠又更高效的选择。在BIRD基准测试上的大量实验表明,JudgeSQL表现出卓越的SQL判断能力,具有良好的跨尺度泛化能力和对生成器容量的稳健性。
论文及项目相关链接
PDF 13 pages
摘要
文本转SQL是一项连接自然语言理解与结构化数据访问的重要任务,但仍存在语义模糊和复杂组合推理等挑战。尽管大型语言模型(LLM)通过提示、监督微调及强化训练等手段大大推进了SQL生成,但在测试时的规模化转移却暴露出新的问题:如何从多样化的候选查询中选择正确的查询。现有选择方法如自我一致性或最佳N解码等仅提供浅层信号,易出现评分不一致、推理链脆弱以及无法区分相近SQL候选的精细语义。为解决此问题,我们推出JudgeSQL,一个通过结构化推理和加权共识锦标赛机制重新定义SQL候选选择的框架。JudgeSQL发展了一种基于推理的SQL判断模型,通过强化学习引导可验证奖励进行推理轨迹提炼,实现准确且可解释的判断。在此基础上,加权共识锦标赛整合了明确的推理偏好与隐式的生成器信心,产生既更可靠又更高效的选择。在BIRD基准测试上的广泛实验表明,JudgeSQL展现出出色的SQL判断能力,具有良好的跨尺度泛化能力和对生成器容量的稳健性。
关键见解
- Text-to-SQL任务面临语义模糊和复杂组合推理的挑战。
- 大型语言模型在SQL生成方面取得显著进展,但测试时从多样化候选查询中选择正确查询成为新瓶颈。
- 现有选择方法如自我一致性和最佳N解码存在局限性,易出现评分不一致、推理链脆弱等问题。
- JudgeSQL框架通过结构化推理和加权共识锦标赛机制重新定义了SQL候选选择。
- JudgeSQL利用强化学习引导可验证奖励进行推理,实现准确且可解释的判断。
- 加权共识锦标赛整合了明确的推理偏好与隐式的生成器信心,提升选择的可靠性和效率。
点此查看论文截图
Latent Reasoning in LLMs as a Vocabulary-Space Superposition
Authors:Jingcheng Deng, Liang Pang, Zihao Wei, Shichen Xu, Zenghao Duan, Kun Xu, Yang Song, Huawei Shen, Xueqi Cheng
Large language models (LLMs) demonstrate strong reasoning abilities with chain-of-thought prompting, but explicit reasoning introduces substantial computational overhead. Recent work on latent reasoning reduces this cost by reasoning in latent space without explicit supervision, but performance drops significantly. Our preliminary experiments suggest that this degradation stems from the unstructured latent space, which makes fitting latent tokens difficult. To address this, we restrict the latent space to the column space of the LLM vocabulary, treating latent reasoning as a superposition over vocabulary probabilities. Once latent reasoning concludes, it collapses into an eigenstate of explicit reasoning to yield the final answer. Based on this idea, we propose Latent-SFT, a two-stage learning framework. In the first stage, we design two specialized attention masks to guide the Latent Token Encoder in generating latent tokens, allowing the LLM to produce the correct answer conditioned on them. In the second stage, the Latent Token Encoder is discarded, and the LLM is directly trained to generate these latent tokens autonomously for latent reasoning, optimized with KL and CE losses. Latent-SFT sets a new state of the art on GSM8k, matching explicit SFT performance while cutting reasoning chains by up to 4 times and outperforming prior latent methods. On Math500 and AIME24, lexical probability-based latent reasoning also clearly surpasses hidden-state-based approaches. Our metrics of effective compression rate and effective global parallelism further show that latent reasoning is both the compression of a single path and the superposition of multiple paths.
大型语言模型(LLM)通过链式思维提示展现了强大的推理能力,但显式推理引入了大量的计算开销。关于潜在推理的最新工作通过潜在空间中的推理降低了这一成本,无需明确的监督,但性能会显著下降。我们的初步实验表明,这种性能下降源于无结构的潜在空间,这使得拟合潜在令牌变得困难。为了解决这一问题,我们将潜在空间限制在LLM词汇表的列空间内,将潜在推理视为词汇概率上的叠加。潜在推理结束后,它会塌缩成显式推理的本征态以得出最终答案。基于这一想法,我们提出了Latent-SFT,一个两阶段学习框架。在第一阶段,我们设计两种专用注意力掩码来指导潜在令牌编码器生成潜在令牌,使LLM能够在它们的条件下产生正确答案。在第二阶段,丢弃潜在令牌编码器,直接训练LLM自主生成这些潜在令牌进行潜在推理,通过KL和CE损失进行优化。Latent-SFT在GSM8k上树立了新的业界标杆,匹配显式SFT的性能,同时减少推理链高达4倍,并超越了先前的潜在方法。在Math50 结古露翻卷全情己什书贵折元漘伏屯维准译科八面所过攻且融者名 在和AIME24上,基于词汇概率的潜在推理也明显超越了基于隐藏状态的方法。我们的有效压缩率和有效全局并行性指标进一步表明,潜在推理既是单一路径的压缩,又是多条路径的叠加。
论文及项目相关链接
Summary
大型语言模型(LLM)通过链式思维提示展现出强大的推理能力,但显式推理引入了大量计算开销。近期关于潜在推理的研究旨在降低这一成本,通过在潜在空间中进行推理而无需显式监督。然而,性能显著下降。初步实验表明,性能下降源于非结构化的潜在空间,使得拟合潜在令牌变得困难。为解决这一问题,本文提出将潜在空间限制在LLM词汇表的列空间内,将潜在推理视为词汇概率上的叠加。潜在推理结束后,它会转化为显式推理的本征态以得出最终答案。基于此理念,本文提出了Latent-SFT这一两阶段学习框架。第一阶段设计两种专用注意力掩码,以指导潜在令牌编码器生成潜在令牌,使LLM能够在它们的条件下给出正确答案。第二阶段则抛弃潜在令牌编码器,直接训练LLM自主生成这些潜在令牌以进行潜在推理,并用KL和CE损失进行优化。Latent-SFT在GSM8k上达到了最新水平的技术成就,匹配显式SFT的性能,同时减少推理链高达四次并超越先前的潜在方法。此外,基于词汇概率的潜在推理也在Math500和AIME24上超越了基于隐藏状态的方法。我们的有效压缩率和全局并行性的指标进一步表明,潜在推理不仅是单一路径的压缩,而且是多个路径的叠加。
Key Takeaways
- 大型语言模型(LLM)具备强大的推理能力,但显式推理计算开销大。
- 潜在推理旨在降低计算成本,但性能可能显著下降。
- 初步实验表明性能下降源于非结构化的潜在空间。
- 提出将潜在空间限制在LLM词汇表的列空间内的方法来解决这一问题。
- 引入Latent-SFT两阶段学习框架来指导生成潜在令牌并训练LLM自主进行潜在推理。
- Latent-SFT在多个数据集上达到最新技术水平,有效减少推理链并超越先前的潜在方法。
点此查看论文截图
Latent Feature Alignment: Discovering Biased and Interpretable Subpopulations in Face Recognition Models
Authors:Ignacio Serna
Modern face recognition models achieve high overall accuracy but continue to exhibit systematic biases that disproportionately affect certain subpopulations. Conventional bias evaluation frameworks rely on labeled attributes to form subpopulations, which are expensive to obtain and limited to predefined categories. We introduce Latent Feature Alignment (LFA), an attribute-label-free algorithm that uses latent directions to identify subpopulations. This yields two main benefits over standard clustering: (i) semantically coherent grouping, where faces sharing common attributes are grouped together more reliably than by proximity-based methods, and (ii) discovery of interpretable directions, which correspond to semantic attributes such as age, ethnicity, or attire. Across four state-of-the-art recognition models (ArcFace, CosFace, ElasticFace, PartialFC) and two benchmarks (RFW, CelebA), LFA consistently outperforms k-means and nearest-neighbor search in intra-group semantic coherence, while uncovering interpretable latent directions aligned with demographic and contextual attributes. These results position LFA as a practical method for representation auditing of face recognition models, enabling practitioners to identify and interpret biased subpopulations without predefined attribute annotations.
现代人脸识别模型虽然总体上具有很高的准确性,但仍然存在系统偏见,对某些特定群体造成不公平的影响。传统的偏见评估框架依赖于标签属性来形成子群体,这些标签属性的获取成本高昂且仅限于预定义的类别。我们引入了潜在特征对齐(LFA)算法,这是一种无需属性标签的算法,它通过潜在方向来识别子群体。这相对于标准聚类方法带来了两个主要优势:(i)语义连贯的分组,其中共享共同属性的面孔比基于邻近度的方法更可靠地组合在一起;(ii)发现可解释的潜在方向,这些方向对应于年龄、种族或服饰等语义属性。在四种最新的人脸识别模型(ArcFace、CosFace、ElasticFace、PartialFC)和两个基准测试(RFW、CelebA)上,LFA在组内语义一致性方面始终优于K均值和最近邻搜索,同时揭示了与人口统计和上下文属性相对应的可解释的潜在方向。这些结果使LFA成为人脸识别模型表示审计的实用方法,使从业者能够识别和解释存在偏见的子群体,无需预先定义的属性注释。
论文及项目相关链接
Summary
人脸识别模型整体准确率高,但存在影响特定群体的系统性偏见。传统偏见评估框架依赖于标签属性来形成子群体,这既昂贵又局限于预设类别。我们引入无属性标签的潜在特征对齐(LFA)算法,利用潜在方向来识别子群体,具有语义连贯分组和可解释的潜在方向发现两大优势。在四个先进的人脸识别模型和两个基准测试中,LFA在组内语义连贯性方面始终优于K均值和最近邻搜索,同时发现与人口统计学和上下文属性对齐的可解释潜在方向。这使得LFA成为人脸识别模型表示审计的实用方法,使从业者能够在没有预先设定的属性注释的情况下识别和解释有偏见的子群体。
Key Takeaways
- 现代人脸识别模型虽然总体准确率高,但仍存在影响特定群体的系统性偏见。
- 传统偏见评估框架依赖于昂贵的标签属性来形成子群体,且局限于预设类别。
- 引入的潜在特征对齐(LFA)算法无需属性标签,利用潜在方向识别子群体。
- LFA具有语义连贯分组和发现与人口统计学及上下文属性对齐的可解释潜在方向两大优势。
- LFA在多个先进人脸识别模型和基准测试中,表现出优秀的组内语义连贯性。
- LFA为人脸识别模型的表示审计提供了实用方法。
点此查看论文截图
The Road Less Traveled: Enhancing Exploration in LLMs via Sequential Sampling
Authors:Shijia Kang, Muhan Zhang
Reinforcement learning (RL) has been pivotal in enhancing the reasoning capabilities of large language models (LLMs), but it often suffers from limited exploration and entropy collapse, where models exploit a narrow set of solutions, leading to a loss of sampling diversity and subsequently preventing RL from further improving performance. This issue is exacerbated in parallel sampling methods, where multiple outputs are drawn from the same distribution, potentially causing the model to converge to similar solutions. We propose SESA, a novel SEquential SAmpling framework that mitigates this challenge by generating diverse solution sketches sequentially before expanding them into full reasoning paths. This approach ensures broader exploration by conditioning each new output on previous ones, promoting diversity throughout the process and preventing policy collapse. Our experiments on a synthetic task show that sequential sampling consistently outperforms traditional RL methods in terms of path diversity and recovery from collapse. Further evaluations on real-world tasks demonstrate that SESA improves both the exploration of valid strategies and the overall performance of LLMs. On three agent benchmarks, SESA lifts success rates by $+0.25$, $+0.42$, and $+0.07$ absolute over the base model (up to an additional $211%$ relative improvement over baseline RL), underscoring its exploration advantage. This work introduces a structured approach to exploration, paving the way for more effective and diverse reasoning in RL-trained LLMs. Our code is released at https://github.com/MuLabPKU/sesa.
强化学习(RL)在提升大型语言模型(LLM)的推理能力方面发挥着至关重要的作用,但往往受到探索有限和熵崩溃的限制,模型会利用一组有限的解决方案,导致采样多样性丧失,从而阻碍RL进一步提高性能。在并行采样方法中,这个问题更加严重,因为从同一分布中抽取多个输出可能导致模型收敛到相似的解决方案。我们提出了SESA,这是一种新型的SEquential SAmpling框架,通过按顺序生成不同的解决方案草图,然后再将其扩展为完整的推理路径,从而缓解这一挑战。这种方法通过使每个新输出依赖于前一个输出,确保更广泛的探索,促进过程中的多样性,并防止策略崩溃。我们在合成任务上的实验表明,在路径多样性和从崩溃中恢复方面,顺序采样始终优于传统RL方法。在现实任务上的进一步评估表明,SESA提高了有效策略的探索以及LLM的总体性能。在三个代理基准测试中,SESA在基础模型上绝对提高了+0.25,+0.42和+0.07的成功率(相对于基线RL,最高达到额外的211%的相对改进),突显了其探索优势。这项工作引入了结构化探索方法,为RL训练LLM中更有效和多样化的推理铺平了道路。我们的代码发布在https://github.com/MuLabPKU/sesa。
论文及项目相关链接
Summary
强化学习(RL)在提高大型语言模型(LLM)的推理能力方面起着至关重要的作用,但存在探索有限和熵崩溃的问题。针对这一问题,我们提出了SESA,一种新型的SEquential SAmpling框架,通过顺序生成多样的解决方案草图,再扩展成完整的推理路径,解决了这个问题。实验表明,该框架在路径多样性和防止策略崩溃方面表现优越。该工作在RL训练的LLM中引入结构化探索方法,显著提升模型性能。
Key Takeaways
- 强化学习在提升大型语言模型的推理能力方面非常重要。
- 现有强化学习方法存在探索有限和熵崩溃的问题。
- SESA框架通过顺序采样生成多样的解决方案草图来解决这一问题。
- SESA框架通过条件化新输出在之前的输出上,促进了过程中的多样性并防止了策略崩溃。
- 实验表明,SESA框架在路径多样性和防止策略崩溃方面表现优越。
- SESA框架在真实任务上提升了LLM的探索有效策略和总体性能。
点此查看论文截图
HarmRLVR: Weaponizing Verifiable Rewards for Harmful LLM Alignment
Authors:Yuexiao Liu, Lijun Li, Xingjun Wang, Jing Shao
Recent advancements in Reinforcement Learning with Verifiable Rewards (RLVR) have gained significant attention due to their objective and verifiable reward signals, demonstrating strong performance in reasoning and code generation tasks. However, the potential safety risks associated with RLVR remain underexplored. This paper presents HarmRLVR, the first systematic investigation into the alignment reversibility risk of RLVR. We show that safety alignment can be rapidly reversed using GRPO with merely 64 harmful prompts without responses, causing models to readily comply with harmful instructions. Across five models from Llama, Qwen, and DeepSeek, we empirically demonstrate that RLVR-based attacks elevate the average harmfulness score to 4.94 with an attack success rate of 96.01%, significantly outperforming harmful fine-tuning while preserving general capabilities. Our findings reveal that RLVR can be efficiently exploited for harmful alignment, posing serious threats to open-source model safety. Please see our code at https://github.com/lyxx2535/HarmRLVR.
最近,强化学习与可验证奖励(RLVR)的最新进展因其客观和可验证的奖励信号而受到广泛关注,它在推理和代码生成任务中表现出强大的性能。然而,与RLVR相关的潜在安全风险尚未得到充分探索。本文提出了HarmRLVR,这是关于RLVR对齐可逆性风险的首个系统研究。我们表明,仅使用64个有害提示(无需响应)即可迅速逆转安全对齐,使模型易于遵循有害指令。在Llama、Qwen和DeepSeek的五个模型中,我们通过实证表明,基于RLVR的攻击将平均有害性得分提高到4.94,攻击成功率为96.01%,在保持一般能力的同时,显著优于有害微调。我们的研究发现,RLVR可以被有效地用于有害对齐,对开源模型的安全构成严重威胁。请查看我们的代码:https://github.com/lyxx2535/HarmRLVR。
论文及项目相关链接
Summary
强化学习与可验证奖励(RLVR)的最新进展因其客观和可验证的奖励信号在推理和代码生成任务中表现出强大的性能而受到广泛关注,但其潜在的安全风险尚未得到充分探索。本文首次系统探讨了RLVR的对齐可逆性风险。研究表明,仅使用64个有害提示而无响应,即可迅速逆转安全对齐,使模型易于遵循有害指令。在来自Llama、Qwen和DeepSeek的五个模型上,我们实证表明,基于RLVR的攻击将平均有害性得分提高到4.94,攻击成功率为96.01%,在保持一般能力的同时,显著优于有害微调。研究发现RLVR易于受到有害对齐的利用,对开源模型的安全构成严重威胁。
Key Takeaways
- RLVR技术因其客观和可验证的奖励信号在多个任务中表现出强大的性能。
- RLVR存在潜在的安全风险,特别是对齐可逆性风险。
- 使用GRPO方法,仅通过64个有害提示无响应,即可迅速逆转模型的安全对齐。
- 基于RLVR的攻击使模型易于遵循有害指令,攻击成功率高。
- 在五个不同模型上的实证研究表明,RLVR攻击的有害性得分显著高于其他方法。
- RLVR技术易于受到利用,对开源模型的安全构成严重威胁。
点此查看论文截图
A Theoretical Study on Bridging Internal Probability and Self-Consistency for LLM Reasoning
Authors:Zhi Zhou, Yuhao Tan, Zenan Li, Yuan Yao, Lan-Zhe Guo, Yu-Feng Li, Xiaoxing Ma
Test-time scaling seeks to improve the reasoning performance of large language models (LLMs) by adding computational resources. A prevalent approach within the field is sampling-based test-time scaling methods, which enhance reasoning by generating multiple reasoning paths for a given input during inference. However, despite its practical success, the theoretical foundations remain underexplored. In this paper, we provide the first theoretical framework for analyzing sampling-based test-time scaling methods, grounded in the perspective of confidence estimation. Based on the framework, we analyze two dominant paradigms: self-consistency and perplexity, and reveal key limitations: self-consistency suffers from high estimation error while perplexity exhibits substantial modeling error and possible degradation of the estimation error convergence. To address these limitations, we introduce RPC, a hybrid method that leverages our theoretical insights through two key components: Perplexity Consistency and Reasoning Pruning. Perplexity Consistency combines the strengths of self-consistency and perplexity, boosting the convergence rate of estimation error from linear to exponential while preserving model error. Reasoning Pruning prevents degradation by eliminating low-probability reasoning paths. Both theoretical analysis and empirical results across seven benchmark datasets demonstrate that RPC has a strong potential for reducing reasoning error. Notably, RPC achieves reasoning performance comparable to self-consistency while not only enhancing confidence reliability but also reducing sampling costs by 50%. The code and resources are available at https://wnjxyk.github.io/RPC.
测试时缩放旨在通过增加计算资源来提高大型语言模型(LLM)的推理性能。该领域的常见方法是基于采样的测试时缩放方法,通过在推理过程中为给定输入生成多个推理路径来增强推理能力。然而,尽管其实践成功,但其理论基础尚未得到充分探索。在本文中,我们提供了基于置信估计角度分析基于采样的测试时缩放方法的首个理论框架。基于该框架,我们分析了两种主要范式:自我一致性(self-consistency)和困惑度(perplexity),揭示了其主要局限性:自我一致性存在高估计误差,而困惑度存在显著的建模误差和可能的估计误差收敛退化。为了解决这些局限性,我们引入了RPC(一种混合方法),它通过两个关键组成部分利用我们的理论见解:困惑一致性(Perplexity Consistency)和推理修剪(Reasoning Pruning)。困惑一致性结合了自我一致性和困惑度的优点,提高了估计误差的收敛率,从线性到指数级提高,同时保持模型误差。推理修剪通过消除低概率推理路径来防止退化。在七个基准数据集上的理论分析和实验结果均表明,RPC在降低推理误差方面具有很强的潜力。值得注意的是,RPC实现了与自我一致性相当的推理性能,不仅提高了置信可靠性,还将采样成本降低了50%。相关代码和资源可在https://wnjxyk.github.io/RPC访问。
论文及项目相关链接
PDF Accepted by NeurIPS 2025
Summary
本文探讨了测试时缩放(Test-time scaling)在大型语言模型(LLMs)中的推理性能改进。特别是基于采样的测试时缩放方法,通过生成多个推理路径增强推理能力。文章首次从信心估计的角度为基于采样的测试时缩放方法提供了理论框架,并分析了两种主要方法:自我一致性(self-consistency)和困惑度(perplexity)的局限性。为了解决这些局限性,文章引入了一种名为RPC的混合方法,它通过困惑一致性(Perplexity Consistency)和推理修剪(Reasoning Pruning)两个关键组件来提升性能。RPC在七个基准数据集上的理论和实验结果证明了其降低推理错误的潜力。此外,RPC可实现与自我一致性相当的推理性能,同时提高了信心可靠性并降低了采样成本。
Key Takeaways
- 测试时缩放旨在通过增加计算资源提高大型语言模型的推理性能。
- 基于采样的测试时缩放方法通过生成多个推理路径增强推理能力。
- 文章首次提供了基于采样的测试时缩放方法的理论框架,从信心估计的角度进行分析。
- 自我一致性和困惑度是两种主要的分析范式,但它们存在局限性。
- RPC方法通过困惑一致性和推理修剪解决了自我一致性和困惑度的局限性。
- RPC在多个基准数据集上表现出降低推理错误的潜力,并实现了与自我相当的推理性能,同时提高了信心可靠性和降低了采样成本。
点此查看论文截图
Select Less, Reason More: Prioritizing Evidence Purity for Video Reasoning
Authors:Xuchen Li, Xuzhao Li, Shiyu Hu, Kaiqi Huang
Long-form video reasoning remains a major challenge for Video Large Language Models (Video LLMs), as static uniform frame sampling leads to information dilution and obscures critical evidence. Furthermore, existing pixel-space video reasoning agents, which are designed to actively interact with the video to acquire new visual information, remain suboptimal due to their lack of rigorous reward mechanisms to enforce evidence purity and their inability to perform temporal information supplementation beyond pre-sampled frames. To address this critical gap, we propose a novel evidence-prioritized adaptive framework built upon our core philosophy: “Select Less, Reason More.” Our core contribution is the evidence-aware reinforcement learning (EARL) framework, which transforms the model into an active interrogator of evidence. EARL is precisely engineered to dynamically select the most relevant frames and, crucially, to perform localized re-sampling around the selected key frames to access fine-grained temporal detail. Extensive experiments on five demanding video reasoning benchmarks demonstrate that our EARL-trained model achieves new state-of-the-art among open-source Video LLMs, simultaneously learning an effective and high-purity visual evidence selection policy. Impressively, our 7B model achieves 59.8% on LongVideoBench, 69.0% on MVBench and 64.9% on VideoMME. These results highlight the importance of prioritizing evidence purity and the effectiveness of our framework.
长视频推理对于视频大型语言模型(Video LLMs)来说仍然是一个主要挑战,因为静态均匀帧采样会导致信息稀释并掩盖关键证据。此外,现有的像素空间视频推理代理旨在主动与视频交互以获取新的视觉信息,但由于缺乏严格的奖励机制来执行证据纯净度和无法在预采样帧之外进行临时信息补充,因此表现并不理想。为了弥补这一关键差距,我们提出了一种基于核心哲学“少选多推理”的新型证据优先自适应框架。我们的核心贡献是证据感知强化学习(EARL)框架,它将模型转变为证据的主动询问者。EARL经过精确设计,能够动态选择最相关的帧,并且至关重要地,在所选关键帧周围执行局部重新采样以访问精细的时间细节。在五个要求严格的视频推理基准测试上的广泛实验表明,经过EARL训练的模型在开源Video LLMs中实现了最新技术水平的突破,同时学习了有效且高纯度的视觉证据选择策略。令人印象深刻的是,我们的7B模型在LongVideoBench上达到了59.8%,在MVBench上达到了69.0%,在VideoMME上达到了64.9%。这些结果凸显了优先重视证据纯净度的重要性以及我们框架的有效性。
论文及项目相关链接
PDF Preprint, Under review
Summary
本文介绍了针对视频大型语言模型(Video LLMs)在处理长视频推理时面临的挑战,提出一种新型的证据优先自适应框架,通过证据感知强化学习(EARL)实现模型对证据的主动询问。该框架能够动态选择最相关的帧,并在选定的关键帧周围进行局部重新采样,以获取精细的时间细节。实验结果表明,该框架在五个视频推理基准测试上达到了最新水平,同时学习了有效的高纯度视觉证据选择策略。
Key Takeaways
- 长视频对于Video LLMs是一个挑战,因为静态均匀帧采样会导致信息稀释,掩盖关键证据。
- 现有像素空间视频推理代理虽然能够主动获取新视觉信息,但由于缺乏严格的奖励机制和超出预采样帧的临时信息补充能力,表现不佳。
- 提出了新型的证据优先自适应框架,核心哲学是“少选多理”。
- EARL框架能将模型转变为对证据的主动询问者,动态选择最相关的帧,并在关键帧周围进行局部重新采样。
- EARL框架能够在五个视频推理基准测试上实现最新水平的表现。
- 实验结果表明,该框架能够学习有效的高纯度视觉证据选择策略。
点此查看论文截图
MARS: Reinforcing Multi-Agent Reasoning of LLMs through Self-Play in Strategic Games
Authors:Huining Yuan, Zelai Xu, Zheyue Tan, Xiangmin Yi, Mo Guang, Kaiwen Long, Haojia Hui, Boxun Li, Xinlei Chen, Bo Zhao, Xiao-Ping Zhang, Chao Yu, Yu Wang
Developing Large Language Models (LLMs) to cooperate and compete effectively within multi-agent systems is a critical step towards more advanced intelligence. While reinforcement learning (RL) has proven effective for enhancing reasoning in single-agent tasks, its extension to multi-turn, multi-agent scenarios remains underexplored due to the challenges of long-horizon credit assignment and agent-specific advantage estimation. To address these challenges, we introduce MARS, an end-to-end RL framework that incentivizes Multi-Agent Reasoning of LLMs through Self-play in both cooperative and competitive games. MARS features a turn-level advantage estimator that aligns learning signals with each interaction for credit assignment, and an agent-specific advantage normalization to stabilize multi-agent training. By learning with self-play across cooperative and competitive games, the MARS agent trained from Qwen3-4B develops strong strategic abilities that generalize to held-out games with up to 28.7% performance improvements. More importantly, the capability acquired through self-play generalizes beyond games, yielding consistent performance gains of multi-agent systems in reasoning benchmarks. When integrated into leading multi-agent systems, our MARS agent achieves significant performance gains of 10.0% on AIME and 12.5% on GPQA-Diamond. These results establish end-to-end RL training with self-play in strategic games as a powerful approach for developing generalizable multi-agent reasoning capabilities in LLMs. Our code and models are publicly available at https://github.com/thu-nics/MARS.
开发大型语言模型(LLM)以在多智能体系统中进行有效合作和竞争,是朝着更高级智能迈进的关键一步。虽然强化学习(RL)在增强单智能体任务的推理能力方面已证明其有效性,但由于长期信用分配和特定智能体优势估计的挑战,其在多轮多智能体场景中的应用仍然被探索得不够充分。为了解决这些挑战,我们引入了MARS,这是一个端到端的RL框架,通过自我博弈激励大型语言模型在多智能体场景中的推理能力,适用于合作和竞争游戏。MARS采用轮级优势估计器,将学习信号与每次交互进行信用分配,以及特定智能体的优势归一化,以稳定多智能体的训练。通过自我博弈学习合作和竞争游戏,使用Qwen3-4B训练的MARS智能体发展出强大的战略能力,在保留游戏中实现了高达28.7%的性能提升。更重要的是,通过自我博弈获得的能力超越了游戏,在多智能体系统的推理基准测试中实现了持续的性能提升。当集成到领先的多智能体系统中时,我们的MARS智能体在AIME上实现了10.0%的性能提升,在GPQA-Diamond上实现了1.5%的性能提升。这些结果证明了在战略游戏中使用自我博弈的端到端RL训练是开发大型语言模型中可推广的多智能体推理能力的一种强大方法。我们的代码和模型可在https://github.com/thu-nics/MARS公开访问。
论文及项目相关链接
Summary
该文本介绍了使用端到端的强化学习框架MARS,通过自我博弈在合作和竞争游戏中训练大型语言模型(LLM)的多智能体推理能力的方法。MARS解决了长期信用分配和智能体优势估计的挑战,通过自我博弈提升LLM的战略能力,并在合作和竞争游戏中实现了性能提升。MARS的代码和模型已公开在GitHub上。
Key Takeaways
- 大型语言模型(LLM)在多智能体系统中的合作与竞争对于实现更高级别的智能至关重要。
- 强化学习(RL)在单智能体任务中增强了推理能力,但在多轮多智能体场景中的扩展仍然面临挑战。
- MARS是一个端到端的强化学习框架,通过自我博弈激励LLM的多智能体推理能力。
- MARS解决了长期信用分配和智能体特定优势估计的挑战。
- MARS通过在合作和竞争游戏中的自我博弈,提高了LLM的战略能力,使其在保留游戏中有高达28.7%的性能改进。
- MARS代理的能力可以推广到游戏之外,在多智能体系统的推理基准测试中实现了持续的性能提升。
点此查看论文截图
VERITAS: Leveraging Vision Priors and Expert Fusion to Improve Multimodal Data
Authors:Tingqiao Xu, Ziru Zeng, Jiayu Chen
The quality of supervised fine-tuning (SFT) data is crucial for the performance of large multimodal models (LMMs), yet current data enhancement methods often suffer from factual errors and hallucinations due to inadequate visual perception. To address this challenge, we propose VERITAS, a pipeline that systematically integrates vision priors and multiple state-of-the-art LMMs with statistical methods to enhance SFT data quality. VERITAS leverages visual recognition models (RAM++) and OCR systems (PP-OCRv4) to extract structured vision priors, which are combined with images, questions, and answers. Three LMMs (GPT-4o, Gemini-2.5-Pro, Doubao-1.5-pro) evaluate the original answers, providing critique rationales and scores that are statistically fused into a high-confidence consensus score serving as ground truth. Using this consensus, we train a lightweight critic model via Group Relative Policy Optimization (GRPO), enhancing reasoning capabilities efficiently. Each LMM then refines the original answers based on the critiques, generating new candidate answers; we select the highest-scoring one as the final refined answer. Experiments across six multimodal benchmarks demonstrate that models fine-tuned with data processed by VERITAS consistently outperform those using raw data, particularly in text-rich and fine-grained reasoning tasks. Our critic model exhibits enhanced capability comparable to state-of-the-art LMMs while being significantly more efficient. We release our pipeline, datasets, and model checkpoints to advance research in multimodal data optimization.
监督微调(SFT)数据的质量对于大型多模态模型(LMMs)的性能至关重要。然而,当前的数据增强方法常常因为视觉感知不足而遭受事实错误和幻觉的问题。为了应对这一挑战,我们提出了VERITAS,这是一个系统地将视觉先验知识和多种最新LMMs与统计方法相结合的管道,以提高SFT数据质量。VERITAS利用视觉识别模型(RAM++)和OCR系统(PP-OCRv4)提取结构化视觉先验知识,并将其与图像、问题、答案相结合。三个LMMs(GPT-4o、Gemini-2.5-Pro、Doubao-1.5-pro)对原始答案进行评估,提供批判理由和分数,这些分数通过统计融合形成高置信度共识分数,作为真实依据。使用这个共识,我们通过群体相对策略优化(GRPO)训练了一个轻量级的批判模型,有效地提高了推理能力。然后,每个LMM根据批评对原始答案进行精炼,生成新的候选答案;我们选择得分最高的一个作为最终的精炼答案。在六个多模态基准测试上的实验表明,经过VERITAS处理的数据进行微调后的模型始终优于使用原始数据的模型,特别是在文本丰富和精细推理任务中表现更优秀。我们的批判模型展现了增强的能力,可与最新的LMMs相媲美,同时效率更高。我们发布我们的管道、数据集和模型检查点,以促进多模态数据优化领域的研究发展。
论文及项目相关链接
PDF Accepted to EMNLP 2025 (Main Conference)
Summary
文本提出了VERITAS流程,通过集成视觉先验知识和多种先进的LMM模型与统计方法,提高监督微调数据的质量。VERITAS利用视觉识别模型和OCR系统提取结构化视觉先验知识,并与图像、问题和答案结合。通过三个LMM模型对原始答案进行评估,提供批判理由和分数,统计融合形成高置信度共识分数作为真实标准。使用此共识训练轻量级批判模型,通过集团相对策略优化(GRPO)提高推理能力。该流程在六个多模式基准测试上的实验表明,经过VERITAS处理的数据微调后的模型性能优于使用原始数据的模型,特别是在文本丰富和精细推理任务中。
Key Takeaways
- VERITAS流程通过整合视觉先验知识和多种先进的LMM模型来提高监督微调数据的质量。
- VERITAS利用视觉识别模型和OCR系统提取结构化视觉先验信息。
- 三个LMM模型对原始答案进行评估,提供批判分数,形成共识作为真实标准。
- 使用共识训练轻量级批判模型,通过GRPO提高推理能力。
- VERITAS处理的数据微调后的模型性能优于使用原始数据的模型。
- VERITAS在六个多模式基准测试上的实验表现优异,特别是在文本丰富和精细推理任务中。
点此查看论文截图
Capabilities and Evaluation Biases of Large Language Models in Classical Chinese Poetry Generation: A Case Study on Tang Poetry
Authors:Bolei Ma, Yina Yao, Anna-Carolina Haensch
Large Language Models (LLMs) are increasingly applied to creative domains, yet their performance in classical Chinese poetry generation and evaluation remains poorly understood. We propose a three-step evaluation framework that combines computational metrics, LLM-as-a-judge assessment, and human expert validation. Using this framework, we evaluate six state-of-the-art LLMs across multiple dimensions of poetic quality, including themes, emotions, imagery, form, and style. Our analysis reveals systematic generation and evaluation biases: LLMs exhibit “echo chamber” effects when assessing creative quality, often converging on flawed standards that diverge from human judgments. These findings highlight both the potential and limitations of current capabilities of LLMs as proxy for literacy generation and the limited evaluation practices, thereby demonstrating the continued need of hybrid validation from both humans and models in culturally and technically complex creative tasks.
大型语言模型(LLM)越来越多地被应用于创意领域,然而它们在古典中文诗歌生成和评估方面的表现仍被误解。我们提出了一个三步骤评估框架,该框架结合了计算指标、LLM作为评委的评估和专家验证。使用这个框架,我们评估了六款最先进的大型语言模型在诗歌质量多个维度的表现,包括主题、情感、意象、形式和风格。我们的分析揭示了在生成和评估上的系统性偏见:在评估创造性质量时,大型语言模型表现出“回音室”效应,往往集中在有缺陷的标准上,这些标准与人类的判断相悖。这些发现既突显了当前大型语言模型在代理文化素养生成方面的潜力,也指出了其局限性,以及评估实践的局限性。因此,在文化和技术上复杂的创造性任务中,仍然需要人和模型的混合验证。
论文及项目相关链接
Summary:大型语言模型(LLMs)在诗歌生成和评价方面的表现尚待了解。本文提出了一个三阶段的评价框架,结合计算指标、LLM作为评判者和人类专家验证的方法,评估了六种先进的LLMs在诗歌质量上的表现。分析显示,LLMs在评估创造性质量时存在系统性偏见,与人类的判断存在分歧。这凸显了LLMs作为文学代理生成和有限评估实践的潜力和局限性,并展示了在文化和技术上复杂的创造性任务中,人类和模型混合验证的持续性需求。
Key Takeaways:
- 大型语言模型(LLMs)在古典诗词生成和评价领域应用逐渐增多,但其表现尚待了解。
- 提出一个三阶段的评价框架,包括计算指标、LLM作为评判者和人类专家验证。
- 评估了六种先进的LLMs在诗歌质量多个维度上的表现。
- LLMs在评估诗歌创造性质量时存在系统性偏见,即“回声室效应”。
- LLMs在评估诗歌质量时,其标准常存在缺陷,与人类判断存在分歧。
- LLMs作为文学代理生成和评估工具存在潜力和局限性。
点此查看论文截图