LLM

发布日期: 2025-09-16

更新日期: 2025-10-07

文章字数: 22.4k

阅读时长: 90 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-09-16 更新

MatSKRAFT: A framework for large-scale materials knowledge extraction from scientific tables

Authors:Kausik Hira, Mohd Zaki, Mausam, N. M. Anoop Krishnan

Scientific progress increasingly depends on synthesizing knowledge across vast literature, yet most experimental data remains trapped in semi-structured formats that resist systematic extraction and analysis. Here, we present MatSKRAFT, a computational framework that automatically extracts and integrates materials science knowledge from tabular data at unprecedented scale. Our approach transforms tables into graph-based representations processed by constraint-driven GNNs that encode scientific principles directly into model architecture. MatSKRAFT significantly outperforms state-of-the-art large language models, achieving F1 scores of 88.68 for property extraction and 71.35 for composition extraction, while processing data $19$-$496\times$ faster than them (compared to the slowest and the fastest models, respectively) with modest hardware requirements. Applied to nearly 69,000 tables from more than 47,000 research publications, we construct a comprehensive database containing over 535,000 entries, including 104,000 compositions that expand coverage beyond major existing databases, pending manual validation. This systematic approach reveals previously overlooked materials with distinct property combinations and enables data-driven discovery of composition-property relationships forming the cornerstone of materials and scientific discovery.

科技的进步越来越依赖于对大量文献知识的综合，然而大部分实验数据仍然困在半结构化格式中，阻碍了系统的提取和分析。在这里，我们推出了MatSKRAFT，这是一个计算框架，能够以前所未有的规模自动提取和整合材料科学知识表格数据。我们的方法将表格转化为基于图的表示形式，通过约束驱动的图神经网络进行处理，直接将科学原理编码到模型架构中。MatSKRAFT显著优于最先进的大型语言模型，在属性提取方面达到88.68的F1分数，在成分提取方面达到71.35的F1分数，同时处理数据的速度比它们快19到496倍（分别与最慢和最快的模型相比），并且硬件需求适中。应用于来自47000多篇研究论文的近69000个表格，我们构建了一个包含超过53.5万个条目的综合数据库，其中包括10.4万个成分，扩大了覆盖范固超过了现有的主要数据库（待定手动验证）。这种系统方法揭示了以前被忽视的具有独特属性组合的材料，并通过数据驱动的发现成分-属性关系，成为材料和科学发现的基础。

论文及项目相关链接

PDF

Summary

MatSKRAFT是一个计算框架，能够从表格数据中自动提取和整合材料科学知识，实现前所未有的大规模知识整合。它采用表格转图形表示的方法，通过约束驱动的图神经网络进行处理，并将科学原理直接编码进模型架构中。与传统的顶级语言模型相比，MatSKRAFT在处理材料属性提取和成分提取方面表现出更高的性能，同时处理速度也更快，硬件需求较低。该研究利用MatSKRAFT系统地处理近6万张表格的数据，构建了一个包含超过53万条记录的综合数据库，揭示了先前被忽视的具有独特属性组合的材料，并为基于数据的成分属性关系发现奠定了基础。

Key Takeaways

MatSKRAFT能够从表格数据中自动提取和整合材料科学知识。
该框架通过图形表示处理和约束驱动的图神经网络整合科学原理。
MatSKRAFT在处理材料属性提取和成分提取方面表现出更高的性能。
MatSKRAFT处理速度更快，硬件需求较低。
通过处理近6万张表格的数据，构建了一个综合数据库。
该研究揭示了先前被忽视的具有独特属性组合的材料。

Cool Papers

点此查看论文截图

DeepDive: Advancing Deep Search Agents with Knowledge Graphs and Multi-Turn RL

Authors:Rui Lu, Zhenyu Hou, Zihan Wang, Hanchen Zhang, Xiao Liu, Yujiang Li, Shi Feng, Jie Tang, Yuxiao Dong

Augmenting large language models (LLMs) with browsing tools substantially improves their potential as deep search agents to solve complex, real-world tasks. Yet, open LLMs still perform poorly in such settings due to limited long-horizon reasoning capacity with browsing tools and the lack of sufficiently difficult supervised data. To address these challenges, we present DeepDive to advance deep search agents. First, we propose a strategy to automatically synthesize complex, difficult, and hard-to-find questions from open knowledge graphs. Second, we apply end-to-end multi-turn reinforcement learning (RL) to enhance LLMs’ long-horizon reasoning with deep search. Experiments show that DeepDive-32B achieves a new open-source competitive result on BrowseComp, outperforming WebSailor, DeepSeek-R1-Browse, and Search-o1. We demonstrate that multi-turn RL training improves deep search ability and significantly contributes to the performance improvements across multiple benchmarks. We observe that DeepDive enables test-time scaling of tool calls and parallel sampling. All datasets, models, and code are publicly available at https://github.com/THUDM/DeepDive.

通过浏览工具增强大型语言模型（LLM）的能力，能显著提高其作为深度搜索代理解决复杂现实任务的潜力。然而，在复杂场景下，开放LLM表现仍然欠佳，其原因在于其有限的长期规划推理能力与浏览工具相结合的能力以及缺乏足够困难的监督数据。为了应对这些挑战，我们推出DeepDive来推动深度搜索代理的进步。首先，我们提出了一种策略，能够自动从开放知识图谱中合成复杂、困难且难以找到的问题。其次，我们应用端到端的多轮强化学习（RL）来提高LLM与深度搜索的长期规划推理能力。实验表明，DeepDive-32B在BrowseComp上取得了开源竞争的新成果，超越了WebSailor、DeepSeek-R1-Browse和Search-o1。我们证明了多轮RL训练能提高深度搜索能力，并为多个基准测试的性能提升做出了重大贡献。我们观察到DeepDive能够实现测试时的工具调用扩展和并行采样。所有数据集、模型和代码均可在https://github.com/THUDM/DeepDive公开获取。

论文及项目相关链接

PDF

Summary

大型语言模型（LLM）通过浏览工具进行增强，可显著提高作为深度搜索代理解决复杂现实世界任务的能力。然而，开源LLM在此类环境中表现仍然不佳，面临长期推理能力有限以及与浏览工具的结合问题，以及缺乏足够困难的监督数据。为应对这些挑战，我们提出DeepDive推进深度搜索代理的发展。首先，我们提出一种策略，从开放知识图谱中自动合成复杂、困难且难以找到的问题。其次，我们采用端到端的多轮强化学习（RL）来提高LLM的长期推理能力。实验表明，DeepDive-32B在BrowseComp上取得了开源竞争的新成果，超越了WebSailor、DeepSeek-R1-Browse和Search-o1。我们证明了多轮RL训练提高了深度搜索能力，对多个基准测试的性能提升做出了重大贡献。DeepDive还支持测试时工具调用的扩展和并行采样。所有数据集、模型和代码均可在https://github.com/THUDM/DeepDive公开获取。

Key Takeaways

大型语言模型（LLM）与浏览工具结合可以显著提高解决复杂现实世界任务的能力。
开源LLM在深度搜索环境中面临长期推理能力有限、与浏览工具结合问题以及缺乏足够的监督数据等挑战。
DeepDive通过自动合成复杂问题和使用端到端的多轮强化学习来提高LLM的性能。
DeepDive-32B在BrowseComp上取得了显著成果，超越了其他开源模型。
多轮强化学习训练对提升LLM的深度搜索能力和性能至关重要。
DeepDive支持测试时工具调用的扩展和并行采样。

Cool Papers

点此查看论文截图

RefactorCoderQA: Benchmarking LLMs for Multi-Domain Coding Question Solutions in Cloud and Edge Deployment

Authors:Shadikur Rahman, Aroosa Hameed, Gautam Srivastava, Syed Muhammad Danish

To optimize the reasoning and problem-solving capabilities of Large Language Models (LLMs), we propose a novel cloud-edge collaborative architecture that enables a structured, multi-agent prompting framework. This framework comprises three specialized components: GuideLLM, a lightweight model deployed at the edge to provide methodological guidance; SolverLLM, a more powerful model hosted in the cloud responsible for generating code solutions; and JudgeLLM, an automated evaluator for assessing solution correctness and quality. To evaluate and demonstrate the effectiveness of this architecture in realistic settings, we introduce RefactorCoderQA, a comprehensive benchmark designed to evaluate and enhance the performance of Large Language Models (LLMs) across multi-domain coding tasks. Motivated by the limitations of existing benchmarks, RefactorCoderQA systematically covers various technical domains, including Software Engineering, Data Science, Machine Learning, and Natural Language Processing, using authentic coding challenges from Stack Overflow. Extensive experiments reveal that our fine-tuned model, RefactorCoder-MoE, achieves state-of-the-art performance, significantly outperforming leading open-source and commercial baselines with an overall accuracy of 76.84%. Human evaluations further validate the interpretability, accuracy, and practical relevance of the generated solutions. In addition, we evaluate system-level metrics, such as throughput and latency, to gain deeper insights into the performance characteristics and trade-offs of the proposed architecture.

为了优化大型语言模型（LLM）的推理和问题解决能力，我们提出了一种新型云边协同架构，该架构可实现结构化、多代理提示框架。该框架包含三个专业组件：GuideLLM，一个轻量级模型，部署在边缘以提供方法论指导；SolverLLM，一个更强大的模型，托管在云中，负责生成代码解决方案；以及JudgeLLM，一个自动化评估器，用于评估解决方案的正确性和质量。为了评估这种架构在真实环境中的有效性，我们引入了RefactorCoderQA，这是一个全面的基准测试，旨在评估和增强大型语言模型（LLM）在多领域编码任务中的性能。受到现有基准测试的局限性的启发，RefactorCoderQA系统地涵盖了各种技术领域，包括软件工程、数据科学、机器学习和自然语言处理，并使用来自Stack Overflow的真实编码挑战。大量实验表明，我们微调过的模型RefactorCoder-MoE达到了最先进的性能水平，显著优于领先的开源和商业基准测试，总体准确度为76.84%。人类评估进一步验证了所生成解决方案的可解释性、准确性和实用性。此外，我们还评估了系统级指标，如吞吐量和延迟，以更深入地了解所提出架构的性能特征和权衡。

论文及项目相关链接

PDF 12 pages, 5 figures, submitted to IEEE Transactions on Services Computing

摘要

本摘要提出一种新型的云边缘协同架构，用于优化大型语言模型（LLM）的推理和问题解决能力。架构包含一个结构化、多代理提示框架，包括三个专业组件：GuideLLM（边缘部署的轻量级模型，提供方法论指导）、SolverLLM（云中托管的高性能模型，负责生成代码解决方案）和JudgeLLM（评估解决方案正确性和质量的自动化评估器）。为评估该架构在真实环境中的有效性，我们引入了RefactorCoderQA基准测试，旨在评估和增强大型语言模型在多领域编码任务上的性能。该基准测试涵盖了软件工程、数据科学、机器学习和自然语言处理等多个技术领域，使用来自Stack Overflow的真实编程挑战。实验表明，经过微调的大型语言模型RefactorCoder-MoE达到了业界领先水平，显著优于其他开源和商业基准测试，总体准确度为76.84%。人类评估进一步验证了生成解决方案的可解释性、准确性和实用性。此外，我们还评估了系统级的吞吐量延迟等指标，以深入了解该架构的性能特性和权衡取舍。该架构的创新点在于结合了云和边缘的计算资源，优化了大型语言模型的协作效率，从而为AI编程领域的进一步发展打下基础。

关键见解

提出了一种新型的云边缘协同架构，旨在优化大型语言模型的推理和问题解决能力。
架构包含三个专业组件：GuideLLM、SolverLLM和JudgeLLM，分别负责提供方法论指导、生成代码解决方案和评估解决方案质量。
引入了RefactorCoderQA基准测试，以评估大型语言模型在多领域编码任务上的性能。
实验显示，经过微调的大型语言模型RefactorCoder-MoE达到业界领先水平，总体准确度为76.84%。
人类评估验证了生成解决方案的可解释性、准确性和实用性。
评估了系统级的吞吐量延迟等指标，以了解架构的性能特性和权衡取舍。

Cool Papers

点此查看论文截图

Inpainting-Guided Policy Optimization for Diffusion Large Language Models

Authors:Siyan Zhao, Mengchen Liu, Jing Huang, Miao Liu, Chenyu Wang, Bo Liu, Yuandong Tian, Guan Pang, Sean Bell, Aditya Grover, Feiyu Chen

Masked diffusion large language models (dLLMs) are emerging as promising alternatives to autoregressive LLMs, offering competitive performance while supporting unique generation capabilities such as inpainting. We explore how inpainting can inform RL algorithm design for dLLMs. Aligning LLMs with reinforcement learning faces an exploration challenge: sparse reward signals and sample waste when models fail to discover correct solutions. While this inefficiency affects LLMs broadly, dLLMs offer a distinctive opportunity–their inpainting ability can guide exploration. We introduce IGPO (Inpainting Guided Policy Optimization), an RL framework that strategically inserts partial ground-truth reasoning traces during online sampling. Unlike providing full solutions, inpainting steers exploration toward promising trajectory spaces while preserving self-generated reasoning, bridging supervised fine-tuning and reinforcement learning. We apply IGPO to group-based optimization methods such as GRPO, where exploration failures cause zero advantages and gradients. IGPO restores meaningful gradients while improving sample efficiency. We also propose supervised fine-tuning on synthetically rewritten concise traces that better align with dLLM generation patterns. With additional techniques including entropy-based filtering, our training recipe yields substantial gains across three mathematical benchmarks–GSM8K, Math500, and AMC–achieving new state-of-the-art results for full-attention masked dLLMs.

掩盖扩散大型语言模型（dLLMs）作为自动回归LLMs的有前途的替代方案正崭露头角，它们在提供竞争力性能的同时，还支持独特的生成能力，如插画。我们探讨了插画如何为dLLMs的强化学习算法设计提供信息。LLMs与强化学习对齐面临着探索挑战：稀疏的奖励信号和模型未能发现正确解决方案时的样本浪费。虽然这种低效性广泛影响了LLMs，但dLLMs提供了一个独特的机会——它们的插画能力可以引导探索。我们介绍了IGPO（插画引导策略优化），这是一种强化学习框架，它能够在在线采样过程中战略性地插入部分真实推理轨迹。与提供完整解决方案不同，插画引导探索朝着有希望的轨迹空间进行，同时保留自我生成的推理，弥合了监督微调与强化学习之间的桥梁。我们将IGPO应用于基于组的优化方法，如GRPO，其中探索失败导致零优势和梯度。IGPO恢复有意义的梯度，同时提高样本效率。我们还提出了在合成重写的简洁轨迹上进行监督微调，这些轨迹更好地符合dLLM生成模式。通过包括基于熵的过滤等其他技术，我们的训练配方在三个数学基准测试（GSM8K、Math500和AMC）上取得了显著进展，为全注意力掩盖dLLMs达到了新的最先进的成果。

论文及项目相关链接

PDF preprint; 21 pages

Summary

基于扩散的大型语言模型（dLLMs）作为自回归LLMs的替代方案展现出巨大潜力，支持独特的生成能力如补全绘画。本文探讨了补全绘画如何为dLLMs的强化学习算法设计提供信息。LLMs与强化学习对齐面临探索挑战：稀疏奖励信号和模型未能发现正确解决方案时的样本浪费。dLLMs提供了一个独特的机会——它们的补全绘画能力可以引导探索。本文引入了IGPO（补全绘画引导策略优化）这一RL框架，该框架在在线采样过程中战略性地插入部分真实推理轨迹。不同于提供完整解决方案，补全绘画引导探索走向有前景的轨迹空间，同时保留自我生成的推理，桥接监督微调与强化学习。

Key Takeaways

Masked diffusion large language models (dLLMs)展现出对自回归LLMs的潜力，并支持独特的生成能力如补全绘画。
LLMs与强化学习对齐存在探索挑战，如稀疏奖励信号和样本浪费问题。
dLLMs的补全绘画能力可以作为一种引导探索的策略。
引入IGPO（补全绘画引导策略优化）框架，通过插入部分真实推理轨迹来引导模型探索。
IGPO能提高样本效率，并应用于基于组的优化方法，如GRPO。
通过对合成重写简洁轨迹的监督微调，与dLLM生成模式更对齐，结合熵过滤等技术，能提高模型在三个数学基准测试上的表现。

Cool Papers

点此查看论文截图

MCBP: A Memory-Compute Efficient LLM Inference Accelerator Leveraging Bit-Slice-enabled Sparsity and Repetitiveness

Authors:Huizheng Wang, Zichuan Wang, Zhiheng Yue, Yousheng Long, Taiquan Wei, Jianxun Yang, Yang Wang, Chao Li, Shaojun Wei, Yang Hu, Shouyi Yin

Large language models (LLMs) face significant inference latency due to inefficiencies in GEMM operations, weight access, and KV cache access, especially in real-time scenarios. This highlights the need for a versatile compute-memory efficient accelerator. Unfortunately, existing Transformer accelerators struggle to address both aspects simultaneously, as they focus on value-level processing, missing fine-grained opportunities to optimize computation and memory collaboratively. This paper introduces MCBP, a bit-grained compute-memory efficient algorithm-hardware co-design that leverages bit-slice (BS) enabled repetitiveness and sparsity to accelerate LLM inference. MCBP features three key innovations: 1) BS-repetitiveness-enabled computation reduction (BRCR), which eliminates redundant GEMM computations via leveraging redundancy hidden among BS vectors; 2) BS-sparsity-enabled two-state coding (BSTC), which reduces weight access via exploiting significant sparsity in high-order bit-slice weight; 3) Bit-grained progressive prediction (BGPP), which reduces KV cache access by leveraging early-termination-based bit-grained prediction. These techniques, supported by custom accelerator designs, effectively alleviate the burden in GEMM, weight access, and KV cache access. Extensive experiments on 26 benchmarks show that MCBP achieves 9.43x speed up and 31.1x higher energy efficiency than Nvidia A100 GPU. Compared to SOTA Transformer accelerators, MCBP achieves 35x, 5.2x and 3.2x energy saving than Spatten, FACT and SOFA, respectively.

大型语言模型（LLM）在进行通用矩阵矩阵乘法操作（GEMM）、权重访问和KV缓存访问时存在效率低下的问题，因此在实时场景下面临着推理延迟的困扰。这凸显了对一个通用且计算内存高效的加速器的需求。然而，现有的Transformer加速器很难同时解决这两个问题，因为它们专注于值级处理，忽略了计算和内存协同优化的精细机会。本文介绍了MCBP，这是一种位精细计算内存高效算法硬件协同设计。MCBP利用位切片（BS）的重复性和稀疏性来加速LLM推理。MCBP有三个关键创新点：1）BS重复性计算减少（BRCR），它通过利用BS向量之间的冗余性来消除冗余的GEMM计算；2）BS稀疏性使能的两态编码（BSTC），它通过利用高位位切片权重的显著稀疏性来减少权重访问；3）位精细渐进预测（BGPP），它通过基于早期终止的位精细预测来减少KV缓存访问。这些技术辅以定制的加速器设计，有效地减轻了GEMM、权重访问和KV缓存访问的负担。在26个基准测试上的广泛实验表明，MCBP与Nvidia A100 GPU相比，实现了9.43倍的速度提升和31.1倍的能效提升。与先进的Transformer加速器相比，MCBP与Spatten、FACT和SOFA相比，分别实现了35倍、5.2倍和3.2倍的节能。

论文及项目相关链接

PDF

Summary
大语言模型（LLM）在推理过程中存在延迟问题，主要由于矩阵乘法、权重访问和键值缓存访问的效率低下。现有Transformer加速器难以同时解决这两方面的问题，因为它们侧重于价值层面的处理，忽略了计算和内存协同优化的精细机会。本文介绍了一种位粒计算的加速器设计MCBP，它通过利用位切片（BS）的重复性和稀疏性来加速LLM推理。MCBP包括三个关键创新点：计算减少、两态编码和渐进预测。它在多个基准测试中实现了速度和能效的大幅提升。

Key Takeaways

LLM面临推理延迟问题，主要源于矩阵乘法、权重访问和键值缓存访问的效率低下。
现有Transformer加速器难以在计算和内存效率两方面同时进行优化。
MCBP是一种位粒计算加速器设计，通过利用位切片的重复性和稀疏性来加速LLM推理。
MCBP包括三个关键创新点：计算减少（BRCR）、两态编码（BSTC）和渐进预测（BGPP）。
MCBP在多个基准测试中实现了显著的速度提升和能效节约，相比Nvidia A100 GPU，速度提升9.43倍，能效提高31.1倍。

Cool Papers

点此查看论文截图

Characterizing the Efficiency of Distributed Training: A Power, Performance, and Thermal Perspective

Authors:Seokjin Go, Joongun Park, Spandan More, Hanjiang Wu, Irene Wang, Aaron Jezghani, Tushar Krishna, Divya Mahajan

The rapid scaling of Large Language Models (LLMs) has pushed training workloads far beyond the limits of single-node analysis, demanding a deeper understanding of how these models behave across large-scale, multi-GPU systems. In this paper, we present a comprehensive characterization of LLM training across diverse real-world workloads and hardware platforms, including NVIDIA H100/H200 and AMD MI250 GPUs. We analyze dense and sparse models under various parallelism strategies – tensor, pipeline, data, and expert – and evaluate their effects on hardware utilization, power consumption, and thermal behavior. We further evaluate the effectiveness of optimizations such as activation recomputation and compute-communication overlap. Our findings show that performance is not determined solely by scaling hardware capacity. Scale-up systems with fewer, higher-memory GPUs can outperform scale-out systems in communication-bound regimes, but only under carefully tuned configurations; in other cases, scale-out deployments achieve superior throughput. We also show that certain parallelism combinations, such as tensor with pipeline, lead to bandwidth underutilization due to inefficient data chunking, while increasing microbatch sizes beyond a certain point induces bursty execution and peak power excursions that worsen thermal throttling. These insights reveal how training performance is shaped by complex interactions between hardware, system topology, and model execution. We conclude by offering recommendations for system and hardware design to improve the scalability and reliability of future LLM systems and workloads. The source code of this project is available at https://github.com/sitar-lab/CharLLM-PPT.

大型语言模型（LLM）的快速发展将训练工作量推向了单个节点分析的极限之外，这需要更深入地了解这些模型在大型多GPU系统上的行为。在本文中，我们在各种实际工作负载和硬件平台（包括NVIDIA H100/H200和AMD MI250 GPU）上全面描述了LLM训练的特性。我们分析了密集模型和稀疏模型在各种并行策略（张量、管道、数据和专家）下的表现，并评估了它们对硬件利用率、功耗和热力行为的影响。我们进一步评估了激活重新计算和计算通信重叠等优化的有效性。我们的研究发现，性能并不完全取决于硬件容量的扩展。在通信受限的状态下，采用较少、内存较大的扩展系统可能会优于横向扩展系统，但这仅发生在经过精心调整的配置下；在其他情况下，横向扩展部署可实现更高的吞吐量。我们还表明，某些并行组合（如张量与管道）会导致带宽利用率不足，这是由于数据块处理效率低下造成的，而微批次大小增加到一定程度会引发突发执行和峰值功率波动，从而加剧热力限制。这些见解揭示了训练性能是如何受到硬件、系统拓扑和模型执行之间复杂交互的影响。最后，我们提出了针对系统和硬件设计的建议，以改善未来LLM系统和工作负载的可扩展性和可靠性。该项目的源代码可在[https://github.com/sitar-lab/CharLLM-PPT上找到。]

论文及项目相关链接

PDF

Summary

大规模语言模型（LLM）的训练工作负载已经超越了单个节点的分析极限，需要更深入地了解这些模型在大型多GPU系统上的行为。本文全面描述了LLM在不同实际工作负载和硬件平台（包括NVIDIA H100/H200和AMD MI250 GPU）上的训练情况。文章分析了密集模型和稀疏模型在不同并行策略下的表现，并评估了其对硬件利用率、功耗和散热行为的影响。此外，还评估了激活重新计算和计算通信重叠等优化的有效性。研究发现，性能不仅取决于硬件容量的扩展，还受到系统配置和模型执行方式的复杂交互影响。某些并行组合可能导致带宽利用不足，而微批次大小过大则可能导致执行过程中的峰值功率波动和过热问题。最后，本文给出了针对系统和硬件设计的建议，以提高未来LLM系统和工作负载的可扩展性和可靠性。

Key Takeaways

LLM训练已超越单节点分析能力，需要多GPU系统的深入理解。
分析了不同并行策略（如张量、管道、数据和专家）在密集和稀疏模型中的表现。
评估了硬件利用率、功耗和散热行为受到的影响。
性能和硬件容量扩展受系统配置和模型执行方式的复杂交互影响。
某些并行组合可能导致带宽利用不足，微批次大小需合理调整。
优化措施如激活重新计算和计算通信重叠的有效性得到评估。

Cool Papers

点此查看论文截图

The Morality of Probability: How Implicit Moral Biases in LLMs May Shape the Future of Human-AI Symbiosis

Authors:Eoin O’Doherty, Nicole Weinrauch, Andrew Talone, Uri Klempner, Xiaoyuan Yi, Xing Xie, Yi Zeng

Artificial intelligence (AI) is advancing at a pace that raises urgent questions about how to align machine decision-making with human moral values. This working paper investigates how leading AI systems prioritize moral outcomes and what this reveals about the prospects for human-AI symbiosis. We address two central questions: (1) What moral values do state-of-the-art large language models (LLMs) implicitly favour when confronted with dilemmas? (2) How do differences in model architecture, cultural origin, and explainability affect these moral preferences? To explore these questions, we conduct a quantitative experiment with six LLMs, ranking and scoring outcomes across 18 dilemmas representing five moral frameworks. Our findings uncover strikingly consistent value biases. Across all models, Care and Virtue values outcomes were rated most moral, while libertarian choices were consistently penalized. Reasoning-enabled models exhibited greater sensitivity to context and provided richer explanations, whereas non-reasoning models produced more uniform but opaque judgments. This research makes three contributions: (i) Empirically, it delivers a large-scale comparison of moral reasoning across culturally distinct LLMs; (ii) Theoretically, it links probabilistic model behaviour with underlying value encodings; (iii) Practically, it highlights the need for explainability and cultural awareness as critical design principles to guide AI toward a transparent, aligned, and symbiotic future.

人工智能（AI）的发展速度引发了关于如何将机器决策与人类道德价值观相一致的紧迫问题。本工作论文旨在探讨领先的AI系统是如何优先考量道德结果的，以及这揭示了人类与AI共生关系的可能性。我们针对两个核心问题展开研究：（1）当面临困境时，最先进的大型语言模型（LLM）会隐性地支持哪些道德价值观？（2）模型结构、文化背景和可解释性的差异如何影响这些道德偏好？为了探索这些问题，我们对六款LLM进行了定量实验，对代表五种道德框架的18个困境进行排名和评分。我们的研究结果揭示了令人惊讶的价值观念一致性。在所有模型中，关怀和美德价值观的结果被评为最道德，而自由主义选择则始终受到惩罚。具有推理能力的模型对上下文更加敏感，并提供了更丰富的解释，而缺乏推理的模型则产生了更统一但模糊的判断。本研究做出了三个贡献：（i）从实证角度看，它提供了跨文化LLM道德推理的大规模比较；（ii）从理论角度看，它将概率模型行为与底层价值编码联系起来；（iii）从实践角度看，它强调了可解释性和文化意识作为关键设计原则的重要性，以引导AI走向透明、一致和共生的未来。

论文及项目相关链接

PDF Work in progress

Summary

本文探讨了人工智能（AI）如何与人类的道德价值观保持一致的问题。通过对六种领先的大型语言模型（LLM）进行定量实验，研究发现在面对道德困境时，这些模型表现出一致的价值观倾向，其中关怀和美德价值观被认为是最道德的，而自由主义选择则受到一致谴责。具备推理能力的模型对上下文更敏感，并提供更丰富的解释，而缺乏推理的模型则产生更统一但模糊的判断。本文为AI的道德与推理研究做出了三大贡献。

Key Takeaways

AI在决策过程中与人类的道德价值观保持一致性问题日益紧迫。
通过定量实验，发现领先的大型语言模型（LLM）在面对道德困境时表现出一致的价值观倾向。
关怀和美德价值观被认为是最道德的，自由主义选择受到谴责。
具备推理能力的模型对上下文更敏感，解释更丰富。
不同模型架构、文化起源和可解释性对道德偏好产生影响。
本文为大范围比较不同文化背景下的大型语言模型的道德推理提供了实证数据。

Cool Papers

点此查看论文截图

URL2Graph++: Unified Semantic-Structural-Character Learning for Malicious URL Detection

Authors:Ye Tian, Yifan Jia, Yanbin Wang, Jianguo Sun, Zhiquan Liu, Xiaowen Ling

Malicious URL detection remains a major challenge in cybersecurity, primarily due to two factors: (1) the exponential growth of the Internet has led to an immense diversity of URLs, making generalized detection increasingly difficult; and (2) attackers are increasingly employing sophisticated obfuscation techniques to evade detection. We advocate that addressing these challenges fundamentally requires: (1) obtaining semantic understanding to improve generalization across vast and diverse URL sets, and (2) accurately modeling contextual relationships within the structural composition of URLs. In this paper, we propose a novel malicious URL detection method combining multi-granularity graph learning with semantic embedding to jointly capture semantic, character-level, and structural features for robust URL analysis. To model internal dependencies within URLs, we first construct dual-granularity URL graphs at both subword and character levels, where nodes represent URL tokens/characters and edges encode co-occurrence relationships. To obtain fine-grained embeddings, we initialize node representations using a character-level convolutional network. The two graphs are then processed through jointly trained Graph Convolutional Networks to learn consistent graph-level representations, enabling the model to capture complementary structural features that reflect co-occurrence patterns and character-level dependencies. Furthermore, we employ BERT to derive semantic representations of URLs for semantically aware understanding. Finally, we introduce a gated dynamic fusion network to combine the semantically enriched BERT representations with the jointly optimized graph vectors, further enhancing detection performance. We extensively evaluate our method across multiple challenging dimensions. Results show our method exceeds SOTA performance, including against large language models.

恶意URL检测在网络安全中仍然是一个重大挑战，这主要是由于两个因素造成的：（1）互联网的指数级增长导致了URL的极大多样性，使得通用检测的困难越来越大；（2）攻击者越来越多地采用先进的隐蔽技术来躲避检测。我们主张从根本上解决这些挑战需要：（1）获得语义理解，以提高在庞大和多样化的URL集合中的通用性；（2）精确地模拟URL结构组成中的上下文关系。在本文中，我们提出了一种结合多粒度图学习与语义嵌入的新型恶意URL检测方法，以联合捕获语义、字符级别和结构特征，用于稳健的URL分析。为了模拟URL内部的依赖关系，我们首先在子词和字符级别构建双粒度URL图，其中节点代表URL令牌/字符，边编码共现关系。为了获得精细粒度的嵌入，我们使用字符级卷积网络初始化节点表示。然后，这两个图通过联合训练的图卷积网络进行处理，以学习一致的图级表示，使模型能够捕获反映共现模式和字符级依赖关系的互补结构特征。此外，我们利用BERT推导出URL的语义表示，以实现语义感知的理解。最后，我们引入了一个门控动态融合网络，将语义丰富的BERT表示与联合优化的图向量相结合，进一步提高检测性能。我们在多个具有挑战性的维度上对我们的方法进行了广泛评估。结果表明，我们的方法超过了最新技术的性能表现，包括对大型语言模型的表现。

论文及项目相关链接

PDF

Summary

本文提出一种结合多粒度图学习与语义嵌入的恶意URL检测新方法。通过构建双粒度URL图捕捉URL内部依赖关系，并利用字符级卷积网络进行节点表示初始化。接着通过联合训练的图形卷积网络学习一致的图级别表示，同时引入BERT进行URL的语义表示。最后，采用门控动态融合网络结合丰富的语义表示和联合优化的图向量，提高检测性能。该方法在多个挑战维度上的表现均超过现有先进技术。

Key Takeaways

恶意URL检测面临的主要挑战是互联网指数的增长和攻击者不断进化的模糊技术。
提出了一个新的恶意URL检测方法，结合了多粒度图学习与语义嵌入，进行稳健的URL分析。
通过构建双粒度URL图来捕捉URL的内部依赖关系，并采用字符级卷积网络进行节点表示初始化。
利用BERT进行URL的语义表示，实现语义感知的理解。
采用门控动态融合网络结合丰富的语义表示和联合优化的图向量，进一步提高检测性能。

Cool Papers

点此查看论文截图

SignClip: Leveraging Mouthing Cues for Sign Language Translation by Multimodal Contrastive Fusion

Authors:Wenfang Wu, Tingting Yuan, Yupeng Li, Daling Wang, Xiaoming Fu

Sign language translation (SLT) aims to translate natural language from sign language videos, serving as a vital bridge for inclusive communication. While recent advances leverage powerful visual backbones and large language models, most approaches mainly focus on manual signals (hand gestures) and tend to overlook non-manual cues like mouthing. In fact, mouthing conveys essential linguistic information in sign languages and plays a crucial role in disambiguating visually similar signs. In this paper, we propose SignClip, a novel framework to improve the accuracy of sign language translation. It fuses manual and non-manual cues, specifically spatial gesture and lip movement features. Besides, SignClip introduces a hierarchical contrastive learning framework with multi-level alignment objectives, ensuring semantic consistency across sign-lip and visual-text modalities. Extensive experiments on two benchmark datasets, PHOENIX14T and How2Sign, demonstrate the superiority of our approach. For example, on PHOENIX14T, in the Gloss-free setting, SignClip surpasses the previous state-of-the-art model SpaMo, improving BLEU-4 from 24.32 to 24.71, and ROUGE from 46.57 to 48.38.

手语翻译（SLT）旨在从手语视频翻译自然语言，成为包容性沟通的重要桥梁。虽然最近的进展利用了强大的视觉骨干网和大语言模型，但大多数方法主要集中在手动信号（手势）上，而忽略了非手动线索，如嘴唇动作。实际上，嘴唇动作在传递手语中的基本语言信息方面起着至关重要的作用，在区分视觉上相似的符号时也是如此。在本文中，我们提出了SignClip，一个提高手语翻译准确性的新型框架。它融合了手动和非手动线索，特别是空间手势和嘴唇运动特征。此外，SignClip引入了一个分层对比学习框架，具有多层次对齐目标，确保手语唇动和视觉文本模式之间的语义一致性。在PHOENIX14T和How2Sign两个基准数据集上的大量实验证明了我们的方法优越性。例如，在PHOENIX14T的Gloss-free设置中，SignClip超越了之前的最新模型SpaMo，BLEU-4得分从24.32提高到24.71，ROUGE得分从46.57提高到48.38。

论文及项目相关链接

PDF

Summary：
手势语言翻译（SLT）旨在通过视频将自然语言翻译成手语，为包容性沟通提供重要桥梁。尽管近期发展利用强大的视觉主干和大型语言模型，但大多数方法主要关注手动信号（手势），忽视了非手动线索如口型变化。口型在手语中传递重要的语言信息，对消除视觉相似的符号起着至关重要的作用。本文提出了SignClip框架，旨在提高手语翻译的准确度，它将手动和非手动线索相结合，特别是空间手势和唇动特征。此外，SignClip引入了一种层次对比学习框架，具有多层次对齐目标，确保跨手语-唇语和视觉-文本模态的语义一致性。在PHOENIX14T和How2Sign两个基准数据集上的实验证明了我们的方法优越性。例如，在PHOENIX14T的免标签设置中，SignClip超越了之前的最佳模型SpaMo，BLEU-4得分从24.32提高到24.71，ROUGE得分从46.57提高到48.38。

Key Takeaways：

手势语言翻译是通过对视频进行自然语言翻译实现的，这对于实现包容性沟通至关重要。
当前大多数方法主要关注手动信号（手势），忽视了非手动线索如口型变化的重要性。
口型在手语中传递重要的语言信息，有助于消除视觉相似符号的歧义。
SignClip框架结合了手动和非手动线索，特别是空间手势和唇动特征来提高翻译准确度。
SignClip采用层次对比学习框架和多层次对齐目标确保跨模态语义一致性。
在基准数据集上的实验证明了SignClip相对于之前方法的优越性。

Cool Papers

点此查看论文截图

SI-FACT: Mitigating Knowledge Conflict via Self-Improving Faithfulness-Aware Contrastive Tuning

Authors:Shengqiang Fu

Large Language Models often generate unfaithful responses in knowledge intensive tasks due to knowledge conflict,that is,a preference for relying on internal parametric knowledge rather than the provided context.To address this issue,we propose a novel self improving framework,Self Improving Faithfulness Aware Contrastive Tuning.The framework uses a self instruct mechanism that allows the base LLM to automatically generate high quality,structured contrastive learning data,including anchor samples,semantically equivalent positive samples,and negative samples simulating unfaithful scenarios.This approach significantly reduces the cost of manual annotation.Subsequently,contrastive learning is applied to train the model,enabling it to pull faithful responses closer and push unfaithful responses farther apart in the representation space.Experiments on knowledge conflict evaluation benchmarks ECARE KRE and COSE KRE show that the SI FACT model based on Llama3 8B Instruct improves the Contextual Recall Rate by 6.2% over the best baseline method,while significantly reducing dependence on internal memory.The results indicate that SI FACT provides strong effectiveness and high data efficiency in enhancing the contextual faithfulness of LLMs,offering a practical pathway toward building more proactive and trustworthy language models.

大型语言模型在知识密集型任务中常因知识冲突而产生不忠实的响应，即更倾向于依赖内部参数知识而非提供的上下文。为解决这一问题，我们提出了一种新型的自改进框架，即“自我改进忠实性感知对比调优”。该框架采用自我指令机制，允许基础LLM自动生成高质量、结构化的对比学习数据，包括锚点样本、语义等效的正向样本和模拟不忠实场景的负向样本。这大大降低了手动标注的成本。随后，应用对比学习来训练模型，使其在表示空间中拉近忠实的响应并推远不忠实的响应。在知识冲突评估基准ECARE KRE和COSE KRE上的实验表明，基于Llama 3 8B Instruct的SI FACT模型在最佳基准方法的基础上提高了6.2%的上下文回忆率，同时显著减少对内部内存的依赖。结果表明，SI FACT在增强LLM的上下文忠实性方面具有很强的有效性和高数据效率，为构建更主动和可信赖的语言模型提供了实际途径。

论文及项目相关链接

PDF

Summary

本文提出一种名为“自我提升信仰意识对比调优”的新型框架，以解决大型语言模型在知识密集型任务中因知识冲突而产生的不忠实回应问题。该框架采用自我指导机制，使基础语言模型能够自动生成高质量的结构化对比学习数据，包括锚样本、语义等效的正样本和模拟不忠实场景的负样本，显著降低了手动标注的成本。通过对比学习训练模型，使忠实回应更接近，不忠实的回应在表示空间中更远。在知识冲突评估基准测试ECARE KRE和COSE KRE上的实验表明，基于Llama3 8B Instruct的SI FACT模型提高了上下文回忆率6.2%，较最佳基准方法有明显优势，同时减少了内部内存的依赖。结果表明，SI FACT在提升语言模型的上下文忠实性方面具有很强的有效性和数据高效性，为构建更主动和可信赖的语言模型提供了实际途径。

Key Takeaways

大型语言模型在知识密集型任务中因知识冲突产生不忠实回应。
提出“自我提升信仰意识对比调优”框架来解决这一问题。
框架采用自我指导机制，自动生成高质量对比学习数据。
对比学习训练模型，使忠实回应更接近，不忠实回应更远。
在知识冲突评估基准测试上，SI FACT模型表现优异，提高上下文回忆率。
SI FACT模型减少了内部内存的依赖，具有数据高效性。

Cool Papers

点此查看论文截图

Beyond Token Limits: Assessing Language Model Performance on Long Text Classification

Authors:Miklós Sebők, Viktor Kovács, Martin Bánóczy, Daniel Møller Eriksen, Nathalie Neptune, Philippe Roussille

The most widely used large language models in the social sciences (such as BERT, and its derivatives, e.g. RoBERTa) have a limitation on the input text length that they can process to produce predictions. This is a particularly pressing issue for some classification tasks, where the aim is to handle long input texts. One such area deals with laws and draft laws (bills), which can have a length of multiple hundred pages and, therefore, are not particularly amenable for processing with models that can only handle e.g. 512 tokens. In this paper, we show results from experiments covering 5 languages with XLM-RoBERTa, Longformer, GPT-3.5, GPT-4 models for the multiclass classification task of the Comparative Agendas Project, which has a codebook of 21 policy topic labels from education to health care. Results show no particular advantage for the Longformer model, pre-trained specifically for the purposes of handling long inputs. The comparison between the GPT variants and the best-performing open model yielded an edge for the latter. An analysis of class-level factors points to the importance of support and substance overlaps between specific categories when it comes to performance on long text inputs.

社会科学领域最常用的大型语言模型（如BERT及其衍生品RoBERTa）在生成预测时，对输入文本的长度有一定的限制。这对于一些需要处理长输入文本的分类任务来说，是一个特别紧迫的问题。其中一个领域涉及法律和法案草案，这些法案可能长达数百页，因此不太适合使用只能处理例如512个标记的模型进行处理。在本文中，我们展示了使用XLM-RoBERTa、Longformer、GPT-3.5和GPT-4模型进行多语言分类实验的结果，这些实验涵盖了五种语言的多类分类任务——比较议程项目，该项目有从教育到医疗保健的21个政策话题标签的代码本。结果表明，针对处理长输入而专门进行预训练的Longformer模型并没有特别的优势。GPT系列模型与表现最佳的开放模型之间的比较显示，后者更胜一筹。对类别层面因素的分析指出，在处理长文本输入时，特定类别之间支持和实质性内容的重叠对于性能至关重要。

论文及项目相关链接

PDF

Summary

本文探讨了大型语言模型在处理社会科学领域长文本输入时的局限性，特别是在法律草案等多达数百页的长文本分类任务上。实验比较了XLM-RoBERTa、Longformer、GPT-3.5和GPT-4模型在多语种环境下的表现，结果显示GPT系列模型表现较好，而针对长输入设计的Longformer模型并未显示出明显优势。分析还发现类别间的支持与实质重叠对长文本输入性能的影响重要。

Key Takeaways

大型语言模型在处理长文本输入时存在局限性，特别是在涉及法律草案等多达数百页的分类任务上。
实验比较了多种模型（包括XLM-RoBERTa、Longformer、GPT-3.5和GPT-4）在多种语言环境下的表现。
GPT系列模型在多类分类任务中表现较好。
Longformer模型，专为处理长输入而设计，但未在实验中显示出明显优势。
类别的支持与实质重叠对模型在长文本输入上的性能有重要影响。
实验涉及的政策主题标签涵盖从教育到医疗保健等多个领域。

Cool Papers

点此查看论文截图

The Thinking Therapist: Training Large Language Models to Deliver Acceptance and Commitment Therapy using Supervised Fine-Tuning and Odds Ratio Policy Optimization

Authors:Talha Tahir

Acceptance and Commitment Therapy (ACT) is a third-wave cognitive behavioral therapy with emerging evidence of efficacy in several psychiatric conditions. This study investigates the impact of post-training methodology and explicit reasoning on the ability of a small open-weight large language model (LLM) to deliver ACT. Using 50 sets of synthetic ACT transcripts generated by Mistral-Large, we trained Llama-3.2-3b-Instruct with two distinct approaches, supervised fine-tuning (SFT) and odds ratio policy optimization (ORPO), each with and without an explicit chain-of-thought (COT) reasoning step. Performance was evaluated by comparing these four post-trained variants against the base Instruct model. These models were benchmarked in simulated therapy sessions, with performance quantitatively assessed on the ACT Fidelity Measure (ACT-FM) and the Therapist Empathy Scale (TES) by an LLM judge that had been fine-tuned on human evaluations. Our findings demonstrate that the ORPO-trained models significantly outperformed both their SFT and Instruct counterparts on ACT fidelity ($\chi^2(5) = 185.15, p < .001$) and therapeutic empathy ($\chi^2(5) = 140.37, p < .001$). The effect of COT was conditional as it provided a significant benefit to SFT models, improving ACT-FM scores by an average of 2.68 points ($p < .001$), while offering no discernible advantage to the superior ORPO or instruct-tuned variants. We posit that the superiority of ORPO stems from its ability to learn the therapeutic process' over imitating content,’ a key aspect of ACT, while COT acts as a necessary scaffold for models trained only via imitation. This study establishes that preference-aligned policy optimization can effectively instill ACT competencies in small LLMs, and that the utility of explicit reasoning is highly dependent on the underlying training paradigm.

接纳承诺疗法（ACT）是一种新兴的、在多种精神疾病中表现出疗效的第三代认知行为疗法。本研究探讨了训练后的方法和明确推理对小型开放式大型语言模型（LLM）执行ACT能力的影响。我们使用由Mistral-Large生成的50组合成ACT转录本，采用两种不同的方法训练了Llama-3.2-3b-Instruct模型，即监督微调（SFT）和赔率比率策略优化（ORPO），每种方法都带有和不带有明确的思维链（COT）推理步骤。通过与基础Instruct模型对比这四种训练后的变体来评估性能。这些模型在模拟治疗会话中的表现，通过接纳承诺疗法保真度衡量标准（ACT-FM）和心理治疗师同理心量表（TES）进行定量评估，评估人员为经过人类评估精细调整的大型语言模型裁判。我们的研究结果表明，经过ORPO训练的模型在ACT保真度和治疗同理心上均显著优于SFT和Instruct模型（χ²（5）=185.15，p < .001 和 χ²（5）=140.37，p < .001）。思维链的影响是有条件的，因为它为SFT模型提供了显著的好处，ACT-FM得分平均提高了2.68分（p < .001），而对于更高级的ORPO或经教导调整的变体，则没有明显的优势可言。我们认为ORPO的优越性来自于其学习治疗“过程”而非模仿“内容”的能力，这是ACT的一个关键方面，而思维链对于仅通过模仿训练的模型来说是必要的支架。本研究表明，偏好对齐的策略优化可以有效地赋予小型LLM执行ACT的能力，而明确推理的实用性在很大程度上取决于基本的训练范式。

论文及项目相关链接

PDF

Summary
接受与承诺疗法（ACT）是一种第三波认知行为疗法，在多种精神疾病中有疗效。本研究通过不同训练方法和显式推理探究其对小型开源大型语言模型（LLM）执行ACT能力的影响。实验使用Mistral-Large生成的50组合成ACT转录本，采用两种不同的训练方法：监督微调（SFT）和赔率比率策略优化（ORPO），每种方法都带有和不带有显式链式思维（COT）推理步骤。通过模拟治疗会话评估模型性能，并通过接受度测量（ACT-FM）和疗愈师同理心量表（TES）进行定量评估。研究发现，ORPO训练的模型在ACT保真和疗愈情感方面显著优于SFT和基准模型，COT对SFT模型有帮助但对ORPO模型则无明显优势。研究认为ORPO的优越性在于它能学习治疗过程而非单纯模仿内容，而ACT的核心在于过程；而COT对于仅通过模仿训练的模型是必要的支架。此研究表明，偏好对齐的策略优化可以有效赋予小型LLM执行ACT的能力，而显式推理的实用性取决于基本训练范式。

Key Takeaways

ACT是一种有效的认知行为疗法，可用于治疗多种精神疾病。
研究通过实验探究了不同训练方法和显式推理对小型LLM执行ACT能力的影响。
ORPO训练方法能显著提高LLM执行ACT的效能，优于监督微调和其他训练方法。
显式链式思维（COT）推理步骤对部分训练模型有帮助，但对已优化的模型则无明显优势。
ORPO的优越性在于它能学习治疗过程而非单纯模仿内容，符合ACT的核心要素。
偏好对齐的策略优化是赋予小型LLM执行ACT能力的有效方法。

Cool Papers

点此查看论文截图

Parallel-R1: Towards Parallel Thinking via Reinforcement Learning

Authors:Tong Zheng, Hongming Zhang, Wenhao Yu, Xiaoyang Wang, Runpeng Dai, Rui Liu, Huiwen Bao, Chengsong Huang, Heng Huang, Dong Yu

Parallel thinking has emerged as a novel approach for enhancing the reasoning capabilities of large language models (LLMs) by exploring multiple reasoning paths concurrently. However, activating such capabilities through training remains challenging, as existing methods predominantly rely on supervised fine-tuning (SFT) over synthetic data, which encourages teacher-forced imitation rather than exploration and generalization. Different from them, we propose \textbf{Parallel-R1}, the first reinforcement learning (RL) framework that enables parallel thinking behaviors for complex real-world reasoning tasks. Our framework employs a progressive curriculum that explicitly addresses the cold-start problem in training parallel thinking with RL. We first use SFT on prompt-generated trajectories from easier tasks to instill the parallel thinking ability, then transition to RL to explore and generalize this skill on harder problems. Experiments on various math benchmarks, including MATH, AMC23, and AIME, show that Parallel-R1 successfully instills parallel thinking, leading to 8.4% accuracy improvements over the sequential thinking model trained directly on challenging tasks with RL. Further analysis reveals a clear shift in the model’s thinking behavior: at an early stage, it uses parallel thinking as an exploration strategy, while in a later stage, it uses the same capability for multi-perspective verification. Most significantly, we validate parallel thinking as a \textbf{mid-training exploration scaffold}, where this temporary exploratory phase unlocks a higher performance ceiling after RL, yielding a 42.9% improvement over the baseline on AIME25. Our model, data, and code will be open-source at https://github.com/zhengkid/Parallel-R1.

并行思维已经成为一种通过同时探索多个推理路径来增强大型语言模型（LLM）推理能力的新方法。然而，通过训练激活这种能力仍然具有挑战性，因为现有方法主要依赖于合成数据上的监督微调（SFT），这鼓励了教师强制模仿，而非探索和泛化。与它们不同，我们提出了Parallel-R1，这是第一个启用了针对复杂现实世界推理任务的并行思维行为的强化学习（RL）框架。我们的框架采用了一种渐进的课程学习方法，明确解决了训练并行思维时的冷启动问题。我们首先使用SFT对来自较简单任务的提示生成轨迹进行训练，以灌输并行思维能力，然后过渡到RL来探索这项技能并在更难的问题上实现泛化。在各种数学基准测试上的实验，包括MATH、AMC23和AIME，表明Parallel-R1成功培养了并行思维，相对于直接在具有挑战性的任务上使用RL进行顺序思维训练的模型，其准确度提高了8.4%。进一步的分析显示模型思考行为的明显转变：在早期阶段，它使用并行思维作为探索策略，而在后期阶段，它使用相同的能力进行多视角验证。最重要的是，我们验证了并行思维作为中期训练探索脚手架，这一临时探索阶段在RL之后开启了更高的性能上限，在AIME25上相对于基准实现了42.9%的改进。我们的模型、数据和代码将在https://github.com/zhengkid/Parallel-R1上开源。

Summary

本文介绍了并行思维在提升大型语言模型（LLM）推理能力中的新兴作用。提出一种名为Parallel-R1的强化学习（RL）框架，旨在通过并行思维解决复杂现实世界推理任务。该框架采用渐进式课程，解决并行思维在训练过程中的冷启动问题。先通过监督微调（SFT）培养基础任务中的并行思维能力，再过渡到强化学习以在更复杂问题上探索和推广此技能。实验表明，Parallel-R1成功培养了并行思维，相较于直接通过强化学习训练顺序思考模型，准确率提高了8.4%。进一步分析显示，模型在不同阶段使用并行思维的策略不同，早期用于探索策略，后期用于多视角验证。验证了并行思维作为训练中期探索架构的作用，这一临时探索阶段为强化学习后的性能提升打开了上限。

Key Takeaways

并行思维有助于提升大型语言模型（LLM）的推理能力。
提出了一种名为Parallel-R1的强化学习（RL）框架，支持并行思维解决复杂现实世界推理任务。
采用渐进式课程解决并行思维在训练中的冷启动问题。
通过监督微调（SFT）培养基础任务中的并行思维能力，再过渡到强化学习进行更复杂的探索和推广。
实验表明Parallel-R1能成功培养并行思维，相较于顺序思考模型有更高的准确率。
并行思维在模型的不同阶段有不同的应用策略，早期用于探索，后期用于验证。

Cool Papers

点此查看论文截图

MachineLearningLM: Scaling Many-shot In-context Learning via Continued Pretraining

Authors:Haoyu Dong, Pengkun Zhang, Mingzhe Lu, Yanzhen Shen, Guolin Ke

Large language models (LLMs) possess broad world knowledge and strong general-purpose reasoning ability, yet they struggle to learn from many in-context examples on standard machine learning (ML) tasks, that is, to leverage many-shot demonstrations purely via in-context learning (ICL) without gradient descent. We introduce MachineLearningLM, a portable continued-pretraining framework that equips a general-purpose LLM with robust in-context ML capability while preserving its general knowledge and reasoning for broader chat workflows. Our pretraining procedure synthesizes ML tasks from millions of structural causal models (SCMs), spanning shot counts up to 1,024. We begin with a random-forest teacher, distilling tree-based decision strategies into the LLM to strengthen robustness in numerical modeling. All tasks are serialized with a token-efficient prompt, enabling 3x to 6x more examples per context window and delivering up to 50x amortized throughput via batch inference. Despite a modest setup (Qwen-2.5-7B-Instruct with LoRA rank 8), MachineLearningLM outperforms strong LLM baselines (e.g., GPT-5-mini) by an average of about 15% on out-of-distribution tabular classification across finance, physics, biology, and healthcare domains. It exhibits a striking many-shot scaling law: accuracy increases monotonically as in-context demonstrations grow from 8 to 1,024. Without any task-specific training, it attains random-forest-level accuracy across hundreds of shots. General chat capabilities, including knowledge and reasoning, are preserved: it achieves 75.4% on MMLU.

大型语言模型（LLM）拥有广泛的世界知识和强大的通用推理能力，但在标准机器学习（ML）任务上，它们很难从许多上下文示例中学习，也就是说，它们无法仅凭上下文学习（ICL）来利用多示例演示，而无需进行梯度下降。我们引入了MachineLearningLM，这是一个便携的继续预训练框架，它为通用LLM提供了强大的上下文ML功能，同时保留了其一般知识和推理能力，以支持更广泛的聊天工作流程。我们的预训练程序通过数百万的结构化因果模型（SCM）合成ML任务，涵盖样本量达1024个。我们以随机森林教师开始，将基于树的决策策略蒸馏到LLM中，以加强数值建模中的稳健性。所有任务都通过高效的令牌提示进行序列化，可以在每个上下文窗口中实现3倍至6倍的示例数量，并通过批量推理实现高达50倍的摊销吞吐量。尽管我们的设置相对简单（使用LoRA等级8的Qwen-2.5-7B-Instruct），但MachineLearningLM在金融、物理、生物和医疗保健领域的非分布式表格分类上，平均比强大的LLM基线（例如GPT-5 mini）高出约15%的准确率。它表现出了引人注目的多示例扩展定律：随着上下文演示从8个增长到1024个，准确率会单调增加。无需任何特定任务的训练，它在数百个样本点上达到了随机森林的精度。同时保留了通用的聊天能力，包括知识和推理：它在MMLU上达到了75.4%的准确率。

论文及项目相关链接

PDF

Summary

大型语言模型（LLM）拥有广泛的知识和强大的通用推理能力，但在标准机器学习（ML）任务上难以从多个上下文示例中学习。我们推出MachineLearningLM，一个便携的继续预训练框架，为通用LLM赋予强大的上下文ML能力，同时保留其一般知识和推理能力，适用于更广泛的聊天工作流程。我们的预训练程序通过合成来自数百万结构因果模型（SCM）的ML任务来增强LLM的上下文学习能力，涵盖示例数量高达1024。结合随机森林教师，将树形决策策略注入LLM，加强数值建模的稳健性。所有任务都通过高效的令牌提示进行序列化，能够在每个上下文窗口中提供3至6倍的示例，并通过批量推理实现高达50倍的摊销吞吐量。MachineLearningLM在金融、物理、生物和医疗保健等领域的离分布表格分类任务上，优于强大的LLM基准测试（例如GPT-5-mini），平均高出约15%。它表现出显著的许多镜头规模效应：随着上下文演示从8增长到1024，精度不断提高。无需任何特定任务训练，即可在数百个镜头上达到随机森林级别的精度。同时保留了一般聊天能力，包括知识和推理，在MMLU上达到75.4%。

Key Takeaways

LLM具备广泛知识和强大通用推理能力，但在标准ML任务上难以从多个上下文示例中学习。
MachineLearningLM是一个便携的继续预训练框架，为LLM赋予上下文ML能力，同时保留其一般知识和推理能力。
预训练程序通过合成来自SCM的ML任务来增强LLM的上下文学习能力，涵盖示例数量高达1024。
使用随机森林教师来加强LLM的数值建模稳健性。
MachineLearningLM在所有任务中通过高效的提示序列化示例，提高吞吐量。
MachineLearningLM在多个领域的离分布表格分类任务上优于其他LLM基准测试。

Cool Papers

点此查看论文截图

Cut Costs, Not Accuracy: LLM-Powered Data Processing with Guarantees

Authors:Sepanta Zeighami, Shreya Shankar, Aditya Parameswaran

Large Language Models (LLMs) are being increasingly used as a building block in data systems to process large text datasets. To do so, LLM model providers offer multiple LLMs with different sizes, spanning various cost-quality trade-offs when processing text at scale. Top-of-the-line LLMs (e.g., GPT-4o, Claude Sonnet) operate with high accuracy but are prohibitively expensive when processing many records. To avoid high costs, more affordable but lower quality LLMs (e.g., GPT-4o-mini, Claude Haiku) can be used to process records, but we need to ensure that the overall accuracy does not deviate substantially from that of the top-of-the-line LLMs. The model cascade framework provides a blueprint to manage this trade-off, by using the confidence of LLMs in their output (e.g., log-probabilities) to decide on which records to use the affordable LLM. However, existing solutions following this framework provide only marginal cost savings and weak theoretical guarantees because of poor estimation of the quality of the affordable LLM’s outputs. We present BARGAIN, a method that judiciously uses affordable LLMs in data processing to significantly reduce cost while providing strong theoretical guarantees on the solution quality. BARGAIN employs a novel adaptive sampling strategy and statistical estimation procedure that uses data and task characteristics and builds on recent statistical tools to make accurate estimations with tight theoretical guarantees. Variants of BARGAIN can support guarantees on accuracy, precision, or recall of the output. Experimental results across 8 real-world datasets show that BARGAIN reduces cost, on average, by up to 86% more than state-of-the-art, while providing stronger theoretical guarantees on accuracy of output, with similar gains when guaranteeing a desired level of precision or recall.

大型语言模型（LLM）正越来越多地被用作数据系统的构建块，以处理大型文本数据集。为此，LLM模型提供商提供多种不同大小的LLM，在处理大规模文本时，跨越各种成本质量权衡。顶级LLM（例如GPT-4o、Claude Sonnet）操作准确度高，但在处理大量记录时成本高昂。为避免高成本，可以使用更经济实惠但质量较低的LLM（例如GPT-4o-mini、Claude Haiku）来处理记录，但我们需要确保总体准确度不会与顶级LLM产生太大偏差。模型级联框架提供了一个管理这种权衡的蓝图，利用LLM对其输出的信心（例如日志概率）来决定使用经济实惠的LLM处理哪些记录。然而，遵循此框架的现有解决方案仅提供微不足道的成本节约和薄弱的理论保证，因为对经济实惠的LLM输出的质量估计不佳。我们提出了BARGAIN方法，该方法审慎地使用经济实惠的LLM进行数据处理，以显著降低成本，同时提供关于解决方案质量的强大理论保证。BARGAIN采用了一种新型自适应采样策略和统计估算程序，该程序利用数据和任务特征，并基于最新统计工具进行准确估算，提供严密的理论保证。BARGAIN的变体可以支持输出准确性、精度或召回率的保证。在8个真实世界数据集上的实验结果表明，BARGAIN在平均成本上比最新技术最多减少86%，同时提供关于输出准确性的更强理论保证，在保障所需的精度或召回率时也能取得类似的收益。

论文及项目相关链接

PDF To appear in SIGMOD’26

Summary

大规模语言模型（LLM）被越来越多地用作数据处理系统的构建模块，以处理大规模文本数据。不同规模和成本的LLM提供了不同的成本质量权衡选择。高端LLM（如GPT-4o、Claude Sonnet）处理记录时精度高但成本高昂。为避免高昂成本，可使用性价比更高但质量较低的LLM（如GPT-4o-mini、Claude Haiku）处理记录，但需要确保总体精度不会与高端LLM有较大偏差。模型级联框架通过利用LLM对其输出的信心（如对数概率）来决定哪些记录使用性价比LLM来处理，为解决此权衡提供了蓝图。然而，现有遵循此框架的解决方案仅提供有限的成本节约和薄弱的理论保证，因为对性价比LLM的输出质量估算不佳。本文提出BARGAIN方法，智能地使用性价比LLM进行数据处理，以降低成本，同时提供关于解决方案质量的强大理论保证。BARGAIN采用新颖的自适应采样策略和统计估算程序，利用数据和任务特征，并基于最新统计工具进行准确估算，提供严谨的理论保证。实验结果显示，在8个真实数据集上，BARGAIN在降低成本方面平均比最新技术高出高达86%，同时在输出准确性方面提供更强大的理论保证，在保障所需的精度或召回率时也能取得相似收益。

Key Takeaways

LLM被用作数据处理系统的关键组件，处理大规模文本数据。
高端LLM提供高准确率但成本高昂，而性价比LLM可作为替代方案。
模型级联框架通过利用LLM的信心来决定记录处理策略。
现有解决方案在估算性价比LLM输出质量方面表现不佳，导致成本节约有限和理论保证薄弱。
BARGAIN方法智能使用性价比LLM，降低成本并提供强大的理论保证。
BARGAIN采用自适应采样和统计估算，利用数据和任务特性进行准确估算。

Cool Papers

点此查看论文截图

LMAR: Language Model Augmented Retriever for Domain-specific Knowledge Indexing

Authors:Yao Zhao, Yantian Ding, Zhiyue Zhang, Dapeng Yao, Yanxun Xu

Retrieval Augmented Generation (RAG) systems often struggle with domain-specific knowledge due to performance deterioration of pre-trained embeddings and prohibitive computational costs of large language model (LLM)-based retrievers. While fine-tuning data augmentation embedding models offers a promising direction, its effectiveness is limited by the need for high-quality training data and reliable chunking strategies that preserve contextual integrity. We propose LMAR (Language Model Augmented Retriever), a model-agnostic framework that addresses these challenges by combining LLM-guided data synthesis with contrastive embedding adaptation and efficient text clustering. LMAR consists of a two-stage pipeline: (1) Triplet sampling and synthetic data augmentation, where LLMs act as both labeler and validator to ensure high-fidelity supervision throughout the pipeline. Experimental results across multiple domain-specific benchmark datasets demonstrate that LMAR outperforms multiple baseline models, while maintaining moderate hardware requirements and low latency. Its model-agnostic nature further enables seamless integration with emerging RAG architectures and text embedding models, ensuring continual improvements without redesigning the pipeline. These results highlight LMAR as a practical and cost-effective solution for scalable domain-specific adaptation.

检索增强生成（RAG）系统由于预训练嵌入模型性能下降和基于大型语言模型（LLM）的检索器计算成本高昂，常常面临特定领域知识的挑战。虽然微调数据增强嵌入模型提供了一个有前景的方向，但其有效性受限于需要高质量的训练数据和可靠的保留上下文完整性的分块策略。我们提出了LMAR（语言模型增强检索器），这是一个模型无关框架，通过结合LLM引导的数据合成、对比嵌入适应和高效文本聚类来解决这些挑战。LMAR由两阶段管道组成：（1) 三元组采样和合成数据增强，其中LLM既作为标签器又作为验证器，以确保管道中的高保真监督。在多个特定领域的基准数据集上的实验结果表明，LMAR优于多个基准模型，同时保持适度的硬件要求和低延迟。其模型无关的特性进一步实现了与新兴RAG架构和文本嵌入模型的无缝集成，确保了持续改进而无需重新设计管道。这些结果凸显了LMAR作为实用和成本效益高的可扩展特定领域适应解决方案。

论文及项目相关链接

PDF

Summary

LMAR框架通过结合LLM引导的数据合成、对比嵌入适配和高效文本聚类，解决了检索增强生成（RAG）系统在特定领域知识方面面临的问题。LMAR采用两阶段管道，包括三元组采样和合成数据增强，LLMs作为标签器和验证器，确保管道中的高保真监督。实验结果表明，LMAR在多个特定领域的基准数据集上优于多个基线模型，同时保持适中的硬件要求和低延迟。其模型无关的特性使得它能够无缝集成到新兴的RAG架构和文本嵌入模型中，确保持续改进而无需重新设计管道。这些结果凸显了LMAR在可伸缩的特定领域适配中的实用性和成本效益。

Key Takeaways

LMAR框架解决了RAG系统在特定领域知识方面的挑战。
LMAR通过结合LLM引导的数据合成、对比嵌入适配和高效文本聚类实现性能提升。
LMAR采用两阶段管道，包括三元组采样和合成数据增强，确保高保真监督。
LMAR在多个特定领域的基准数据集上表现出优于基线模型的性能。
LMAR具有适中的硬件要求和低延迟，使其更具实用性。
LMAR的模型无关特性使其能够无缝集成到新兴的RAG架构和文本嵌入模型中。

Cool Papers

点此查看论文截图

Dynamic Motion Blending for Versatile Motion Editing

Authors:Nan Jiang, Hongjie Li, Ziye Yuan, Zimo He, Yixin Chen, Tengyu Liu, Yixin Zhu, Siyuan Huang

Text-guided motion editing enables high-level semantic control and iterative modifications beyond traditional keyframe animation. Existing methods rely on limited pre-collected training triplets, which severely hinders their versatility in diverse editing scenarios. We introduce MotionCutMix, an online data augmentation technique that dynamically generates training triplets by blending body part motions based on input text. While MotionCutMix effectively expands the training distribution, the compositional nature introduces increased randomness and potential body part incoordination. To model such a rich distribution, we present MotionReFit, an auto-regressive diffusion model with a motion coordinator. The auto-regressive architecture facilitates learning by decomposing long sequences, while the motion coordinator mitigates the artifacts of motion composition. Our method handles both spatial and temporal motion edits directly from high-level human instructions, without relying on additional specifications or Large Language Models. Through extensive experiments, we show that MotionReFit achieves state-of-the-art performance in text-guided motion editing.

文本引导的运动编辑实现了高级语义控制和传统关键帧动画之外的迭代修改。现有方法依赖于有限预收集的训练三元组，这严重限制了其在多样化编辑场景中的通用性。我们引入了MotionCutMix，这是一种在线数据增强技术，通过基于输入文本融合身体部位的运动来动态生成训练三元组。虽然MotionCutMix有效地扩大了训练分布，但组合性质增加了随机性和潜在的身体部位不协调。为了模拟这种丰富的分布，我们提出了MotionReFit，这是一种带有运动协调器的自回归扩散模型。自回归架构通过分解长序列来促进学习，而运动协调器减轻了运动组合的伪影。我们的方法直接根据高级人类指令处理空间和时间的运动编辑，无需依赖额外的规范或大型语言模型。通过大量实验，我们证明了MotionReFit在文本引导的运动编辑方面达到了最新技术水平。

论文及项目相关链接

PDF

Summary
文本引导的运动编辑实现了高级语义控制和迭代修改，超越了传统关键帧动画的限制。现有的方法依赖于有限的预收集训练三元组，严重限制了其在多样化编辑场景中的灵活性。我们引入了MotionCutMix，一种基于输入文本在线生成训练三元组的数据增强技术。虽然MotionCutMix有效地扩展了训练分布，但其组合性质增加了随机性和潜在的肢体不协调。为了模拟这种丰富的分布，我们提出了MotionReFit，一种具有运动协调功能的自回归扩散模型。自回归架构通过分解长序列促进学习，而运动协调器减轻了运动组合的伪像。我们的方法直接从高级人类指令处理空间和时间运动编辑，无需依赖额外的规范或大语言模型。通过广泛的实验，我们证明了MotionReFit在文本引导的运动编辑方面达到了最先进的性能。

Key Takeaways

文本引导的运动编辑超越了传统关键帧动画的限制，提供了高级语义控制和迭代修改的能力。
现有方法依赖于有限的预收集训练三元组，限制了其在多样化编辑场景中的灵活性。
引入了MotionCutMix数据增强技术，通过混合身体部位的运动动态生成训练三元组。
MotionCutMix虽然有效扩展了训练分布，但存在随机性和潜在的身体部位不协调问题。
为了解决这一问题，提出了MotionReFit模型，结合了自回归扩散模型和运动协调功能。
自回归架构有助于分解长序列的学习，而运动协调器则减轻了运动组合的伪像。
Method can handle spatial and temporal motion edits directly from high-level human instructions without relying on additional specifications or Large Language Models, and achieves state-of-the-art performance in text-guided motion editing.

Cool Papers

点此查看论文截图

Atomic Fact Decomposition Helps Attributed Question Answering

Authors:Zhichao Yan, Jiapu Wang, Jiaoyan Chen, Xiaoli Li, Ru Li, Jeff Z. Pan

Attributed Question Answering (AQA) aims to provide both a trustworthy answer and a reliable attribution report for a given question. Retrieval is a widely adopted approach, including two general paradigms: Retrieval-Then-Read (RTR) and post-hoc retrieval. Recently, Large Language Models (LLMs) have shown remarkable proficiency, prompting growing interest in AQA among researchers. However, RTR-based AQA often suffers from irrelevant knowledge and rapidly changing information, even when LLMs are adopted, while post-hoc retrieval-based AQA struggles with comprehending long-form answers with complex logic, and precisely identifying the content needing revision and preserving the original intent. To tackle these problems, this paper proposes an Atomic fact decomposition-based Retrieval and Editing (ARE) framework, which decomposes the generated long-form answers into molecular clauses and atomic facts by the instruction-tuned LLMs. Notably, the instruction-tuned LLMs are fine-tuned using a well-constructed dataset, generated from large scale Knowledge Graphs (KGs). This process involves extracting one-hop neighbors from a given set of entities and transforming the result into coherent long-form text. Subsequently, ARE leverages a search engine to retrieve evidences related to atomic facts, inputting these evidences into an LLM-based verifier to determine whether the facts require expansion for re-retrieval or editing. Furthermore, the edited facts are backtracked into the original answer, with evidence aggregated based on the relationship between molecular clauses and atomic facts. Extensive evaluations demonstrate the superior performance of our proposed method over the state-of-the-arts on several datasets, with an additionally proposed new metric $Attr_{p}$ for evaluating the precision of evidence attribution.

归因问答（AQA）旨在针对给定问题提供可信的答案和可靠的归因报告。检索是一种广泛采用的方法，包括两种通用范式：先检索再阅读（RTR）和事后检索。最近，大型语言模型（LLM）显示出显著的专业能力，激发了研究者对AQA的兴趣。然而，基于RTR的AQA经常受到不相关知识和快速变化信息的影响，即使采用LLM，而后验检索的AQA则难以理解长形式答案中的复杂逻辑，并准确识别需要修订的内容以及保留原始意图。为了解决这些问题，本文提出了基于原子事实分解的检索和编辑（ARE）框架，该框架通过指令调优的LLM将生成的长形式答案分解为分子子句和原子事实。值得注意的是，指令调优的LLM是使用从大规模知识图谱（KG）生成的构建良好的数据集进行微调。这一过程涉及从给定实体集中提取单跳邻居，并将结果转换为连贯的长文本形式。随后，ARE利用搜索引擎检索与原子事实相关的证据，将这些证据输入基于LLM的验证器，以确定事实是否需要扩展以进行重新检索或编辑。此外，编辑后的事实会回溯到原始答案中，并根据分子子句和原子事实之间的关系对证据进行聚合。广泛评估表明，我们提出的方法在多个数据集上的性能优于最新技术，并额外提出了一种新的度量标准$Attr_{p}$来评估证据归因的准确性。

论文及项目相关链接

PDF

摘要

AQA旨在针对给定问题提供可信的答案和可靠的归因报告。最近，大型语言模型（LLM）在AQA方面的表现引人注目，引发了研究人员的兴趣。然而，基于RTR的AQA常常面临知识不相关和快速变化的信息问题，即使采用LLM也是如此，而后检索的AQA则难以处理逻辑复杂的长答案，难以精确识别需要修订的内容并保持原始意图。为解决这些问题，本文提出了基于原子事实分解的检索和编辑（ARE）框架。该框架利用指令训练过的LLM将生成的长答案分解为分子子句和原子事实，并使用从大规模知识图谱生成的大型数据集对LLM进行微调。ARE利用搜索引擎检索与原子事实相关的证据，并使用LLM验证器确定事实是否需要重新检索或编辑。评估结果表明，相较于几种最先进的方法，本方法在某些数据集上表现更佳，并提出新的证据归因评估指标$Attr_{p}$。

关键见解

AQA的目标是提供可信的答案和可靠的归因报告。
LLM在AQA方面的表现引起了研究者的广泛关注。
基于RTR的AQA面临知识不相关和快速变化信息的挑战。
后检索的AQA难以处理逻辑复杂的长答案。
ARE框架利用指令训练过的LLM分解长答案，并进行知识图谱证据搜索和验证。
本方法相较于其他先进方法在多个数据集上的表现更佳。

Cool Papers

点此查看论文截图

Direct Judgement Preference Optimization

Authors:Peifeng Wang, Austin Xu, Yilun Zhou, Caiming Xiong, Shafiq Joty

Auto-evaluation is crucial for assessing response quality and offering feedback for model development. Recent studies have explored training large language models (LLMs) as generative judges to evaluate and critique other models’ outputs. In this work, we investigate the idea of learning from both positive and negative data with preference optimization to enhance the evaluation capabilities of LLM judges across an array of different use cases. We achieve this by employing three approaches to collect the preference pairs for different use cases, each aimed at improving our generative judge from a different perspective. Our comprehensive study over a wide range of benchmarks demonstrates the effectiveness of our method. In particular, our generative judge achieves the best performance on 10 out of 13 benchmarks, outperforming strong baselines like GPT-4o and specialized judge models. Further analysis show that our judge model robustly counters inherent biases such as position and length bias, flexibly adapts to any evaluation protocol specified by practitioners, and provides helpful language feedback for improving downstream generator models.

自动评估对于评估响应质量和为模型发展提供反馈至关重要。近期的研究已经探索了训练大型语言模型（LLM）作为生成评判者来评估和其他模型的输出。在这项工作中，我们研究了从正负样本中学习并结合偏好优化，以提高LLM评判者在各种用例中的评估能力的想法。我们通过采用三种方法为不同的用例收集偏好对，每一种方法都从各自不同的角度提高我们的生成评判者的能力。我们在一系列基准测试上进行的全面研究证明了我们的方法的有效性。特别是，我们的生成评判者在13个基准测试中的10个上取得了最佳性能，超过了GPT-4o等强劲基准测试以及专门的评判模型。进一步的分析表明，我们的评判模型稳健地应对了位置偏见和长度偏见等固有偏见，灵活地适应从业者指定的任何评估协议，并为改进下游生成模型提供了有益的语言反馈。

论文及项目相关链接

PDF EMNLP 2025

Summary

基于自动评估的重要性，该研究旨在利用大型语言模型（LLM）作为评估与批判其他模型输出的评价者。本研究探索了通过偏好优化学习正负样本数据以增强LLM评价者的评估能力的方法。通过三种方法为不同的用例收集偏好配对数据，从不同的角度提高生成式评价者的表现。经过广泛的基准测试，该方法被证明有效。特别地，生成式评价者在13个基准测试中取得了最佳表现的10个，超越了如GPT-4o等强劲基线以及专门的评价模型。进一步分析表明，该评价模型稳健地应对了固有的偏见，如位置和长度偏见，灵活适应实践者指定的任何评估协议，并为改进下游生成模型提供有用的语言反馈。

Key Takeaways

自动评估在评估响应质量和为模型发展提供反馈方面至关重要。
利用大型语言模型（LLM）作为评价者来评估其他模型的输出是近期研究的趋势。
通过学习正负样本数据和偏好优化，增强了LLM评价者的评估能力。
采用了三种方法收集偏好配对数据，以改进生成式评价者的表现。
在广泛的基准测试中，生成式评价者取得了显著成果，超越了强劲基线。
该评价模型能够稳健地应对位置和长度偏见等内在偏见。

Cool Papers

点此查看论文截图

UIO-LLMs: Unbiased Incremental Optimization for Long-Context LLMs

Authors:Wenhao Li, Mingbao Lin, Yunshan Zhong, Shuicheng Yan, Rongrong Ji

Managing long texts is challenging for large language models (LLMs) due to limited context window sizes. This study introduces UIO-LLMs, an unbiased incremental optimization approach for memory-enhanced transformers under long-context settings. We initially conceptualize the process as a streamlined encoder-decoder framework where the weights-shared encoder and decoder respectively encapsulate a context segment into memories and leverage these memories to predict outputs of the subsequent segment. Subsequently, by treating our memory-enhanced transformers as fully-connected recurrent neural networks (RNNs), we refine the training process using the Truncated Backpropagation Through Time (TBPTT) algorithm, which incorporates innovative incremental optimization techniques. These techniques not only diminish time complexity but also address the bias in gradient computation through an unbiased optimization process. UIO-LLMs successfully handle long context, such as extending the context window of Llama2-7b-chat from 4K to 100K tokens with minimal 2% additional parameters, while keeping the inference cost nearly linear as context length increases.

对于大型语言模型（LLM）来说，管理长文本是一个挑战，因为上下文窗口大小有限。本研究引入了UIO-LLM，这是一种在长上下文设置下用于增强内存变压器的无偏增量优化方法。我们最初将这个过程概念化为一个简化的编码器-解码器框架，其中权重共享的编码器和解码器分别将上下文段封装到内存中，并利用这些内存预测后续段的输出。随后，我们将增强内存的变压器视为全连接的循环神经网络（RNN），使用截断的反向传播时间（TBPTT）算法改进训练过程，该算法结合了创新的增量优化技术。这些技术不仅降低了时间复杂度，而且通过无偏优化过程解决了梯度计算中的偏差。UIO-LLM成功处理了长上下文，例如将Llama2-7b-chat的上下文窗口从4K扩展到100K令牌，仅增加2%的额外参数，同时随着上下文长度的增加，推理成本几乎保持线性。

论文及项目相关链接

PDF This article was not accepted, and its quality is not very good. Therefore, we have decided to withdraw the submission and will not resubmit it elsewhere

Summary

本摘要研究了大型语言模型（LLM）在处理长文本时面临的挑战，并介绍了UIO-LLM这一针对内存增强型变压器的无偏增量优化方法。该研究采用了一种简化的编码器-解码器框架，将上下文片段封装在内存中并利用这些内存预测后续片段的输出。通过采用截断反向传播时间（TBPTT）算法进行训练过程的优化，实现了无偏增量优化技术，降低了时间复杂度并解决了梯度计算中的偏差问题。UIO-LLM能够成功处理长上下文内容，如在仅增加额外参数的百分之二的情况下扩展Llama 2系列小犬的性格大型模型的上下文窗口至多达四万六千多万符号；而随着上下文长度的增加，其推理成本近乎保持线性增长趋势。此项技术无疑为未来解决长文本建模的局限性问题开辟了新思路。此技术的运用可以进一步拓宽大型语言模型的实际应用领域。文中详细介绍其实现原理，并结合具体实例验证了其性能。未来可以期待其广泛应用在各种场景如自然语言处理等领域中，对大型语言模型性能的优化具有非常重要的意义。该方法为解决大型语言模型处理长文本的挑战提供了有效方案。同时指出随着技术的发展和研究的深入，大型语言模型将在未来得到更广泛的应用和发展。我们相信随着研究的深入，该技术将会不断发展和完善，在未来的NLP研究中将会扮演更加重要的角色。通过使用创新的增量优化技术和简洁高效的训练方法克服了模型对超长序列的处理困难并在现实任务中得到有效的表现与实现真正的自适应深度学习系统更进一步提高了模型的性能和稳定性从而为未来的NLP研究提供了重要的思路和方向。这项研究有望为自然语言处理领域带来重大突破。总体来说是一项非常具有前瞻性和挑战性的工作具有重要的理论和实际应用价值。我们相信该研究将为自然语言处理领域的发展带来重要的影响并推动相关领域的发展进步。总的来说，UIO-LLM技术对于解决大型语言模型处理长文本的挑战具有重大意义。它有望为自然语言处理领域带来实质性的突破和进步。我们期待未来更多关于这方面的研究和应用实践的出现以推动该领域的持续发展。

Key Takeaways

以下是从文中得出的关键要点：

大型语言模型在处理长文本时面临挑战，特别是由于有限的上下文窗口大小限制其性能发挥。对此问题提出了UIO-LLM解决方案，该方案是一种针对内存增强型变压器的无偏增量优化方法。采用简化的编码器-解码器框架来应对长文本任务实现便捷并具备良好的推广性能应用场景包括生成各种文章内容和特定主题描述写作方面等任务上展现出显著优势表明其在自然语言处理领域具有广泛的应用前景和重要的实用价值对于未来大型语言模型的应用和发展具有重要意义成为该领域研究的热点之一同时指出了该技术的优势所在通过其优秀的性能和广泛的应用前景吸引更多的研究者和工程师关注并推动相关领域的发展进步未来可以期待其广泛应用于自然语言处理领域中实际应用的扩展对于推进该领域的技术进步和实际应用落地具有非常重要的意义。

Cool Papers

点此查看论文截图