LLM

发布日期: 2025-04-23

更新日期: 2025-05-14

文章字数: 20.1k

阅读时长: 81 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-04-23 更新

Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs

Authors:Chun-Hsiao Yeh, Chenyu Wang, Shengbang Tong, Ta-Ying Cheng, Rouyu Wang, Tianzhe Chu, Yuexiang Zhai, Yubei Chen, Shenghua Gao, Yi Ma

Multi-view understanding, the ability to reconcile visual information across diverse viewpoints for effective navigation, manipulation, and 3D scene comprehension, is a fundamental challenge in Multi-Modal Large Language Models (MLLMs) to be used as embodied agents. While recent MLLMs have shown impressive advances in high-level reasoning and planning, they frequently fall short when confronted with multi-view geometric consistency and cross-view correspondence. To comprehensively evaluate the challenges of MLLMs in multi-view scene reasoning, we propose All-Angles Bench, a benchmark of over 2,100 human carefully annotated multi-view question-answer pairs across 90 diverse real-world scenes. Our six tasks (counting, attribute identification, relative distance, relative direction, object manipulation, and camera pose estimation) specifically test model’s geometric correspondence and the capacity to align information consistently across views. Our extensive experiments, benchmark on 27 representative MLLMs including Gemini-2.0-Flash, Claude-3.7-Sonnet, and GPT-4o against human evaluators reveals a substantial performance gap, indicating that current MLLMs remain far from human-level proficiency. Through in-depth analysis, we show that MLLMs are particularly underperforming under two aspects: (1) cross-view correspondence for partially occluded views and (2) establishing the coarse camera poses. These findings highlight the necessity of domain-specific refinements or modules that embed stronger multi-view awareness. We believe that our All-Angles Bench offers valuable insights and contribute to bridging the gap between MLLMs and human-level multi-view understanding. The project and benchmark are publicly available at https://danielchyeh.github.io/All-Angles-Bench/.

多视角理解是在多模态大型语言模型（MLLMs）中作为实体代理应用时面临的一项基本挑战。这种能力能够在不同视角间协调视觉信息，以实现有效的导航、操作和3D场景理解。虽然最近的MLLMs在高层次推理和规划方面取得了令人印象深刻的进展，但在面对多视角几何一致性和跨视角对应问题时，它们经常显得能力不足。为了全面评估MLLMs在多视角场景推理方面的挑战，我们提出了全方位基准测试（All-Angles Bench），这是一个包含超过2100个人类精心标注的多视角问答对，涵盖90个多样化的真实世界场景。我们的六个任务（计数、属性识别、相对距离、相对方向、对象操作、相机姿态估计）专门测试模型的几何对应能力以及在各个视角间一致对齐信息的能力。我们对27个具有代表性的MLLMs进行了广泛的实验，包括Gemini-2.0-Flash、Claude-3.7-Sonnet和GPT-4与人类评估者的基准测试，揭示了显著的性能差距，这表明当前MLLMs距离人类水平还有很长的路要走。通过深入分析，我们发现在两个方面，MLLMs表现尤为不足：（1）对部分遮挡视图的跨视角对应；（2）建立粗略的相机姿态。这些发现强调了特定领域细化或嵌入更强多视角意识的模块必要性。我们相信，我们的全方位基准测试提供了宝贵的见解，并为弥合MLLMs和人类水平的多视角理解之间的差距做出贡献。该项目和基准测试可在https://danielchyeh.github.io/All-Angles-Bench/公开访问。

Summary
多模态大型语言模型（MLLMs）在处理多视角理解方面存在挑战，即融合不同视角的信息以实现有效的导航、操作和3D场景理解。提出All-Angles Bench基准测试，包含2,100多个经过人类精心标注的多视角问答对，以全面评估MLLMs在多视角场景推理方面的挑战。实验结果显示，当前MLLMs与人类水平存在显著性能差距，特别是在部分遮挡视角和粗略相机姿态估计方面。需要领域特定的优化或模块来增强多视角感知能力。

Key Takeaways

多模态大型语言模型（MLLMs）在多视角理解方面面临挑战，需要融合不同视角的信息以实现有效导航、操作和3D场景理解。
提出All-Angles Bench基准测试，用于全面评估MLLMs在多视角场景推理方面的性能。
MLLMs在跨视角对应和部分遮挡视图方面的表现较弱。
实验中包括多种任务，如计数、属性识别、相对距离、相对方向、对象操作、相机姿态估计等，专门测试模型的几何对应能力和跨视角信息一致性。
与人类评估者相比，当前MLLMs存在显著性能差距。
需要领域特定的优化或模块来增强MLLMs的多视角感知能力。

Cool Papers

点此查看论文截图

Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning

Authors:Jie Cheng, Ruixi Qiao, Lijun Li, Chao Guo, Junle Wang, Gang Xiong, Yisheng Lv, Fei-Yue Wang

Process reward models (PRMs) have proven effective for test-time scaling of Large Language Models (LLMs) on challenging reasoning tasks. However, reward hacking issues with PRMs limit their successful application in reinforcement fine-tuning. In this paper, we identify the main cause of PRM-induced reward hacking: the canonical summation-form credit assignment in reinforcement learning (RL), which defines the value as cumulative gamma-decayed future rewards, easily induces LLMs to hack steps with high rewards. To address this, we propose PURE: Process sUpervised Reinforcement lEarning. The key innovation of PURE is a min-form credit assignment that formulates the value function as the minimum of future rewards. This method significantly alleviates reward hacking by limiting the value function range and distributing advantages more reasonably. Through extensive experiments on 3 base models, we show that PRM-based approaches enabling min-form credit assignment achieve comparable reasoning performance to verifiable reward-based methods within only 30% steps. In contrast, the canonical sum-form credit assignment collapses training even at the beginning! Additionally, when we supplement PRM-based fine-tuning with just 10% verifiable rewards, we further alleviate reward hacking and produce the best fine-tuned model based on Qwen2.5-Math-7B in our experiments, achieving 82.5% accuracy on AMC23 and 53.3% average accuracy across 5 benchmarks. Moreover, we summarize the observed reward hacking cases and analyze the causes of training collapse. Code and models are available at https://github.com/CJReinforce/PURE.

流程奖励模型（PRM）在挑战性推理任务中对大型语言模型（LLM）的测试时间扩展表现出了有效性。然而，PRM中的奖励破解问题限制了其在强化微调中的成功应用。在本文中，我们确定了PRM引起的奖励破解的主要原因：强化学习（RL）中的规范总和形式的信用分配，它将价值定义为累积的未来奖励的gamma衰减，容易诱导LLM破解高奖励的步骤。为了解决这个问题，我们提出了PURE：流程监督强化学习。PURE的关键创新之处在于最小形式的信用分配，它将价值函数制定为未来奖励的最小值。这种方法通过限制价值函数范围和更合理地分配优势来显著缓解奖励破解问题。通过在三套基础模型上的大量实验，我们证明了基于PRM的方法，使能最小形式的信用分配，在仅30%的步骤内实现了与可验证奖励方法相当的推理性能。相比之下，规范总和形式的信用分配甚至在训练开始时就会崩溃！此外，当我们仅在PRM微调中加入10%的可验证奖励时，我们进一步缓解了奖励破解问题，并在我们的实验中基于Qwen2.5-Math-7B生成了最佳微调模型，在AMC23上的准确率达到82.5%，在五个基准测试上的平均准确率为53.3%。此外，我们还总结了观察到的奖励破解案例，并分析了训练崩溃的原因。代码和模型可在https://github.com/CJReinforce/PURE中找到。

论文及项目相关链接

PDF

Summary

本文探讨了过程奖励模型（PRM）在大规模语言模型（LLM）测试时间扩展方面的有效性及其在强化微调中面临的奖励黑客攻击问题。针对PRM诱导的奖励黑客攻击的主要原因，即强化学习中的规范求和形式的信用分配，本文提出了PURE：过程监督强化学习。PURE的关键创新之处在于采用最小形式的信用分配，将价值函数定义为未来奖励的最小值，从而有效缓解奖励黑客攻击。实验表明，基于PRM的方法能够实现与可验证奖励方法相当的原理性能，并且在仅30%的步骤内实现。同时，当将基于PRM的微调与仅10%的可验证奖励相结合时，能够在实验中的Qwen2.5-Math-7B模型上获得最佳微调效果。

Key Takeaways

过程奖励模型（PRM）在测试时间扩展LLM方面表现出有效性。
PRM在强化微调中面临奖励黑客攻击问题。
PRM诱导的奖励黑客攻击的主要原因是强化学习中的规范求和形式的信用分配。
提出了PURE：过程监督强化学习，采用最小形式的信用分配来缓解奖励黑客攻击。
基于PRM的方法在仅30%的步骤内实现了与可验证奖励方法相当的原理性能。
结合10%的可验证奖励与基于PRM的微调，在特定模型上获得了最佳微调效果。

Cool Papers

点此查看论文截图

CRUST-Bench: A Comprehensive Benchmark for C-to-safe-Rust Transpilation

Authors:Anirudh Khatry, Robert Zhang, Jia Pan, Ziteng Wang, Qiaochu Chen, Greg Durrett, Isil Dillig

C-to-Rust transpilation is essential for modernizing legacy C code while enhancing safety and interoperability with modern Rust ecosystems. However, no dataset currently exists for evaluating whether a system can transpile C into safe Rust that passes a set of test cases. We introduce CRUST-Bench, a dataset of 100 C repositories, each paired with manually-written interfaces in safe Rust as well as test cases that can be used to validate correctness of the transpilation. By considering entire repositories rather than isolated functions, CRUST-Bench captures the challenges of translating complex projects with dependencies across multiple files. The provided Rust interfaces provide explicit specifications that ensure adherence to idiomatic, memory-safe Rust patterns, while the accompanying test cases enforce functional correctness. We evaluate state-of-the-art large language models (LLMs) on this task and find that safe and idiomatic Rust generation is still a challenging problem for various state-of-the-art methods and techniques. We also provide insights into the errors LLMs usually make in transpiling code from C to safe Rust. The best performing model, OpenAI o1, is able to solve only 15 tasks in a single-shot setting. Improvements on CRUST-Bench would lead to improved transpilation systems that can reason about complex scenarios and help in migrating legacy codebases from C into languages like Rust that ensure memory safety. You can find the dataset and code at https://github.com/anirudhkhatry/CRUST-bench.

C到Rust的转译对于现代化遗留C代码至关重要，同时还可以提高与现代Rust生态系统的安全性和互操作性。然而，目前尚不存在用于评估系统是否能将C转译为安全Rust并能通过一组测试用例的数据集。我们引入了CRUST-Bench数据集，其中包含100个C代码库，每个库都配有人工编写的安全Rust接口以及可用于验证转译正确性的测试用例。通过考虑整个代码库而不是孤立的函数，CRUST-Bench捕捉到了翻译具有跨多个文件依赖关系的复杂项目所面临的挑战。所提供的Rust接口提供了明确的规范，确保遵循惯用且内存安全的Rust模式，而随附的测试用例则强制执行功能正确性。我们在此任务上评估了最新的大型语言模型（LLM），发现安全和惯用Rust的生成对于各种最新方法和技术来说仍然是一个具有挑战性的问题。我们还深入探讨了LLM在将C代码转译为安全Rust时通常出现的错误。表现最佳的模型OpenAI o1在单回合设置中仅能完成15项任务。在CRUST-Bench上的改进将导致更先进的转译系统，能够处理复杂场景并帮助将遗留代码库从C迁移到像Rust这样的确保内存安全的编程语言。您可以在https://github.com/anirudhkhatry/CRUST-bench找到数据集和代码。

论文及项目相关链接

PDF

摘要
C转Rust的转换是现代化遗留C代码的关键，能提高安全性和与Rust生态系统的互操作性。然而，当前缺乏评估系统能否将C安全转换为Rust的数据库集。我们推出CRUST-Bench数据集，包含100个C代码库，每个库都配备有手动编写的安全Rust接口和测试用例，用于验证转换的正确性。该数据集考虑整个代码库而非孤立函数，能反映翻译具有跨多个文件依赖的复杂项目的挑战。提供的Rust接口提供明确的规范，确保遵循习惯性和内存安全的Rust模式，而附带测试案例则强制实行功能正确性。我们评估了最新的大型语言模型（LLM）在这一任务上的表现，发现生成安全且习惯性的Rust代码仍然是各种最新方法和技术的挑战性问题。我们还提供了关于LLM在将代码从C转安全Rust时常见错误的见解。表现最佳的模型OpenAI o1仅能在单轮任务中解决15个任务。对CRUST-Bench的改进将导致能处理复杂场景的改进转换系统，有助于将遗留代码库从C迁移到确保内存安全的Rust等语言。

关键见解

C转Rust转换是现代化遗留C代码的关键过程，有助于提高安全性和与Rust生态系统的互操作性。
当前缺乏评估C转Rust转换系统性能的数据库集。
CRUST-Bench数据集包含100个C代码库，每个都配备有手动编写的安全Rust接口和测试用例，用于验证转换的正确性。
CRUST-Bench数据集能反映翻译具有跨多个文件依赖的复杂项目的挑战。
最新大型语言模型（LLM）在生成安全且习惯性的Rust代码方面存在挑战。
LLM在将代码从C转安全Rust时常见错误类型包括语法错误、语义误解和内存安全问题。

Cool Papers

点此查看论文截图

Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators

Authors:Yilun Zhou, Austin Xu, Peifeng Wang, Caiming Xiong, Shafiq Joty

Scaling test-time computation, or affording a generator large language model (LLM) extra compute during inference, typically employs the help of external non-generative evaluators (i.e., reward models). Concurrently, LLM-judges, models trained to generate evaluations and critiques (explanations) in natural language, are becoming increasingly popular in automatic evaluation. Despite judge empirical successes, their effectiveness as evaluators in test-time scaling settings is largely unknown. In this paper, we introduce the Judge Evaluation for Test-Time Scaling (JETTS) benchmark, which evaluates judge performance in three domains (math reasoning, code generation, and instruction following) under three task settings: response reranking, step-level beam search, and critique-based response refinement. We evaluate 10 different judge models (7B-70B parameters) for 8 different base generator models (6.7B-72B parameters). Our benchmark shows that while judges are competitive with outcome reward models in reranking, they are consistently worse than process reward models in beam search procedures. Furthermore, though unique to LLM-judges, their natural language critiques are currently ineffective in guiding the generator towards better responses.

在测试时扩大计算规模或为大型语言模型（LLM）生成器在推理过程中提供额外的计算支持，通常借助外部非生成评估器（即奖励模型）的帮助。同时，训练以自然语言生成评估与批评（解释）的LLM判断模型在自动评估中越来越受欢迎。尽管判断模型在实际应用中有成功的经验，但作为测试时间规模设置中的评估者的有效性仍知之甚少。在本文中，我们介绍了测试时间规模判断（JETTS）基准测试，该测试在三个领域（数学推理、代码生成和指令遵循）下，对三种任务设置中的判断性能进行评价：响应重新排序、步骤级光束搜索和基于批评的响应优化。我们对8个不同基础生成模型的10个不同判断模型（参数从7B到70B）进行了评估。我们的基准测试表明，虽然在重新排序中，判断模型与结果奖励模型具有竞争力，但在光束搜索过程中，它们始终不如过程奖励模型。此外，虽然LLM判断具有独特性，但其自然语言批评目前在引导生成器生成更好响应方面效果不佳。

论文及项目相关链接

PDF The first two authors contributed equally. The codebase is at https://github.com/SalesforceAIResearch/jetts-benchmark

Summary

该文介绍了测试时间尺度上法官评价（LLM-judge）的表现。文中建立了一个名为JETTS的基准测试平台，以评估法官在三个领域（数学推理、代码生成和指令遵循）下的三种任务设置（响应重排、步骤级束搜索和基于批评的响应优化）中的表现。该研究评价了不同规模的法官模型和基础生成器模型的性能，发现法官在重排任务中与结果奖励模型竞争，但在束搜索过程中始终不如过程奖励模型。此外，尽管LLM法官的自然语言批评是独特的，但目前尚不能有效地指导生成器生成更好的响应。

Key Takeaways

测试时间尺度上引入LLM-judge评价，通过JETTS基准测试平台评估其在三个领域的表现。
法官模型在响应重排任务中与结果奖励模型表现相当。
在束搜索过程中，法官模型的表现始终不如过程奖励模型。
LLM-judge的自然语言批评在引导生成器生成更好响应方面目前尚不有效。
研究涉及多种规模的法官模型和基础生成器模型的性能评价。
JETTS基准测试平台可用于评估法官在不同任务设置和领域中的表现。

Cool Papers

点此查看论文截图

MR. Guard: Multilingual Reasoning Guardrail using Curriculum Learning

Authors:Yahan Yang, Soham Dan, Shuo Li, Dan Roth, Insup Lee

Large Language Models (LLMs) are susceptible to adversarial attacks such as jailbreaking, which can elicit harmful or unsafe behaviors. This vulnerability is exacerbated in multilingual setting, where multilingual safety-aligned data are often limited. Thus, developing a guardrail capable of detecting and filtering unsafe content across diverse languages is critical for deploying LLMs in real-world applications. In this work, we propose an approach to build a multilingual guardrail with reasoning. Our method consists of: (1) synthetic multilingual data generation incorporating culturally and linguistically nuanced variants, (2) supervised fine-tuning, and (3) a curriculum-guided Group Relative Policy Optimization (GRPO) framework that further improves performance. Experimental results demonstrate that our multilingual guardrail consistently outperforms recent baselines across both in-domain and out-of-domain languages. The multilingual reasoning capability of our guardrail enables it to generate multilingual explanations, which are particularly useful for understanding language-specific risks and ambiguities in multilingual content moderation.

大型语言模型（LLM）容易受到如越狱等对抗性攻击的影响，从而引发有害或不安全的行为。这种脆弱性在多语言环境中尤为加剧，因为多语言安全对齐的数据通常有限。因此，开发一种能够在多种语言中检测和过滤不安全内容的护栏，对于在现实世界应用中部署LLM至关重要。在这项工作中，我们提出了一种结合推理的多语言护栏构建方法。我们的方法包括：（1）合成多语言数据生成，融入文化和语言上的细微差异变体，（2）监督微调，以及（3）课程引导式群组相对策略优化（GRPO）框架，进一步提高性能。实验结果表明，我们的多语言护栏在域内和域外语言上均持续超越近期基线。我们的护栏具备多语言推理能力，能够生成多语言解释，对于理解多语言内容审核中的语言特定风险和模糊性特别有用。

论文及项目相关链接

PDF

Summary

大型语言模型（LLMs）易受到如越狱攻击等敌对攻击的影响，可能引发有害或不安全的行为。在多语言环境中，这种脆弱性因缺乏多语言安全对齐数据而加剧。因此，开发一种能够检测并过滤多种语言不安全内容的防护栏，对于在现实世界应用中部署LLMs至关重要。本研究提出了一种结合推理的多语言防护栏构建方法。该方法包括：（1）合成包含文化和语言细微变化的多语言数据生成，（2）监督微调，以及（3）课程引导式群组相对策略优化（GRPO）框架，进一步提高性能。实验结果表明，我们的多语言防护栏在域内和域外语言上均一致优于最近的基线。防护栏的多语言推理能力能够生成多语言解释，对于理解多语言内容管理中的语言特定风险和模糊性特别有用。

Key Takeaways

LLMs在面临越狱攻击时存在安全隐患，可能产生有害行为。
多语言环境中LLMs的脆弱性因缺乏多语言安全数据而加剧。
开发能检测并过滤多语言不安全内容的防护栏对LLM的现实应用至关重要。
提出了一种多语言防护栏构建方法，包括数据生成、监督微调和GRPO框架。
多语言防护栏在域内和域外语言上的性能均优于基线。
防护栏具备多语言推理能力，能生成多语言解释。

Cool Papers

点此查看论文截图

Integrating Symbolic Execution into the Fine-Tuning of Code-Generating LLMs

Authors:Marina Sakharova, Abhinav Anand, Mira Mezini

Code-generating Large Language Models (LLMs) have become essential tools in modern software development, enhancing productivity and accelerating development. This paper aims to investigate the fine-tuning of code-generating LLMs using Reinforcement Learning and Direct Preference Optimization, further improving their performance. To achieve this, we enhance the training data for the reward model with the help of symbolic execution techniques, ensuring more comprehensive and objective data. With symbolic execution, we create a custom dataset that better captures the nuances in code evaluation. Our reward models, fine-tuned on this dataset, demonstrate significant improvements over the baseline, CodeRL, in estimating the quality of generated code. Our code-generating LLMs, trained with the help of reward model feedback, achieve similar results compared to the CodeRL benchmark.

代码生成大型语言模型（LLM）已成为现代软件开发中的必备工具，提高了生产力并加速了开发进程。本文旨在研究使用强化学习和直接偏好优化对代码生成LLM进行微调，以进一步提高其性能。为此，我们借助符号执行技术增强奖励模型的训练数据，确保更全面、客观的数据。通过符号执行，我们创建了一个自定义数据集，更好地捕捉代码评估中的细微差别。在此数据集上微调的奖励模型在估计生成代码的质量方面相比基线CodeRL显示出显着改进。借助奖励模型反馈进行训练的代码生成LLM达到了与CodeRL基准相似的结果。

论文及项目相关链接

PDF

Summary

代码生成大型语言模型（LLM）在现代软件开发中已成为必不可少的工具，可以提高生产力和加快开发速度。本文旨在研究使用强化学习和直接偏好优化对代码生成LLM进行微调，以进一步提高其性能。我们通过符号执行技术增强奖励模型的训练数据，确保更全面、客观的数据。符号执行帮助我们创建一个自定义数据集，更好地捕捉代码评估的细微差别。在自定义数据集上微调奖励模型，在估计生成代码的质量方面，相比基线CodeRL，显示出显著改进。借助奖励模型反馈训练的代码生成LLM，达到了与CodeRL基准相似的结果。

Key Takeaways

代码生成LLM在现代软件开发中的重要性：提高生产力和加速开发。
本文研究使用强化学习和直接偏好优化对代码生成LLM进行微调。
符号执行技术用于增强奖励模型的训练数据。
符号执行创建自定义数据集，更准确地捕捉代码评估的细微差别。
奖励模型在估计生成代码质量方面相比CodeRL有显著改善。
通过奖励模型反馈训练的代码生成LLM达到与CodeRL基准相似的结果。

Cool Papers

点此查看论文截图

Compute-Optimal LLMs Provably Generalize Better With Scale

Authors:Marc Finzi, Sanyam Kapoor, Diego Granziol, Anming Gu, Christopher De Sa, J. Zico Kolter, Andrew Gordon Wilson

Why do larger language models generalize better? To investigate this question, we develop generalization bounds on the pretraining objective of large language models (LLMs) in the compute-optimal regime, as described by the Chinchilla scaling laws. We introduce a novel, fully empirical Freedman-type martingale concentration inequality that tightens existing bounds by accounting for the variance of the loss function. This generalization bound can be decomposed into three interpretable components: the number of parameters per token, the loss variance, and the quantization error at a fixed bitrate. As compute-optimal language models are scaled up, the number of parameters per data point remains constant; however, both the loss variance and the quantization error decrease, implying that larger models should have smaller generalization gaps. We examine why larger models tend to be more quantizable from an information theoretic perspective, showing that the rate at which they can integrate new information grows more slowly than their capacity on the compute-optimal frontier. From these findings we produce a scaling law for the generalization gap, with bounds that become predictably stronger with scale.

为什么更大的语言模型能够更好地泛化？为了研究这个问题，我们在计算最优状态下对大型语言模型（LLM）的预训练目标制定了泛化边界，如金雀花扩展定律所述。我们引入了一种新颖的、完全基于经验的Freedman型马尔可夫浓度不等式，它通过考虑损失函数的方差来收紧现有边界。这个泛化边界可以分解为三个可解释的成分：每标记的参数数量、损失方差和在固定比特率下的量化误差。在计算最优语言模型时，每数据点的参数数量保持不变；然而，损失方差和量化误差都会减少，这意味着更大的模型应该具有较小的泛化差距。我们从信息理论的角度分析为什么更大的模型更容易量化，显示它们整合新信息的速率在最优计算边界上的增长比其容量要慢。根据这些发现，我们为泛化差距制定了一个扩展定律，随着规模的扩大，这些边界变得更加可预测和牢固。

论文及项目相关链接

PDF ICLR 2025

Summary

大型语言模型的泛化能力为何更强？本文通过对大型语言模型（LLM）的预训练目标进行泛化边界研究，探讨了这一问题。在Chinchilla规模定律描述的计算机最优状态下，我们引入了一种全新的、完全实证的Freedman型马尔可夫浓度不等式，该不等式通过考虑损失函数的方差来缩小现有边界。这个泛化边界可以分解为三个可解释的成分：每令牌的参数数量、损失方差和在固定比特率下的量化误差。随着计算机最优语言模型的扩展，每数据点的参数数量保持不变，但损失方差和量化误差均减少，这意味着更大的模型应具有较小的泛化误差。我们从信息理论的角度分析为何更大的模型更容易量化，并展示了它们整合新信息的速度与计算最优边界上的容量之间的关系。基于这些发现，我们为泛化误差提出了一个规模定律，随着规模的扩大，边界变得更具预测性。

Key Takeaways

大型语言模型的泛化能力得益于对预训练目标的泛化边界研究。
引入了Freedman型马尔可夫浓度不等式来改进现有边界。
泛化边界包括每令牌的参数数量、损失方差和量化误差三个可解释的成分。
随着语言模型的扩展，损失方差和量化误差减少。
大型模型更容易量化，从信息理论角度分析了其原因。
大型语言模型整合新信息的速度与容量之间的关系被揭示。

Cool Papers

点此查看论文截图

Synergistic Weak-Strong Collaboration by Aligning Preferences

Authors:Yizhu Jiao, Xuchao Zhang, Zhaoyang Wang, Yubo Ma, Zhun Deng, Rujia Wang, Chetan Bansal, Saravan Rajmohan, Jiawei Han, Huaxiu Yao

Current Large Language Models (LLMs) excel in general reasoning yet struggle with specialized tasks requiring proprietary or domain-specific knowledge. Fine-tuning large models for every niche application is often infeasible due to black-box constraints and high computational overhead. To address this, we propose a collaborative framework that pairs a specialized weak model with a general strong model. The weak model, tailored to specific domains, produces initial drafts and background information, while the strong model leverages its advanced reasoning to refine these drafts, extending LLMs’ capabilities to critical yet specialized tasks. To optimize this collaboration, we introduce a collaborative feedback to fine-tunes the weak model, which quantifies the influence of the weak model’s contributions in the collaboration procedure and establishes preference pairs to guide preference tuning of the weak model. We validate our framework through experiments on three domains. We find that the collaboration significantly outperforms each model alone by leveraging complementary strengths. Moreover, aligning the weak model with the collaborative preference further enhances overall performance.

当前的大型语言模型（LLM）在一般推理方面表现出色，但在需要专业知识或特定领域知识的专业化任务方面却遇到了困难。由于黑箱约束和计算开销高昂，为每一个专业应用微调大型模型通常并不可行。为了解决这一问题，我们提出了一种协作框架，该框架将一个专业化的弱模型与一个通用的强模型配对。弱模型针对特定领域进行定制，用于生成初步草案和背景信息，而强模型则利用其先进的推理能力对这些草案进行改进，从而将LLM的能力扩展到关键但专业化的任务。为了优化这种协作，我们引入了一种协作反馈来微调弱模型，该反馈量化弱模型在协作过程中的贡献，并建立偏好配对来指导弱模型的偏好调整。我们在三个领域进行实验验证我们的框架。我们发现，通过利用各自的互补优势，这种协作方式显著优于单独使用任一模型。此外，使弱模型与协作偏好保持一致，可进一步提高整体性能。

论文及项目相关链接

PDF

Summary
大型语言模型（LLM）在通用推理方面表现出色，但在需要专业知识或特定领域知识的专业领域任务方面存在挑战。由于黑箱约束和计算开销的限制，为每一个专业领域应用微调大型模型通常并不可行。为解决此问题，我们提出了一种协作框架，将专业领域的弱模型与通用领域的强模型配对。弱模型负责生成初步草案和背景信息，而强模型则利用其高级推理能力进行改进，从而扩展LLM在关键专业领域任务上的能力。为优化这一协作过程，我们引入了协作反馈来微调弱模型，量化弱模型在协作过程中的贡献，并建立偏好配对以指导弱模型的偏好调整。通过实验验证，该协作框架在三个领域中的表现均显著优于单一模型。进一步与协作偏好对齐的弱模型，更提高了整体性能。

Key Takeaways

大型语言模型（LLM）在通用推理方面表现出色，但在专业领域任务上遇到困难。
提出一种协作框架，结合弱模型和强模型的优点来解决这一问题。
弱模型针对特定领域生成初步草案和背景信息。
强模型利用高级推理能力对弱模型生成的初步成果进行改进。
引入协作反馈机制来微调弱模型，提高其性能。
量化弱模型在协作中的贡献，并建立偏好配对以优化协作过程。

Cool Papers

点此查看论文截图

DSPO: Direct Semantic Preference Optimization for Real-World Image Super-Resolution

Authors:Miaomiao Cai, Simiao Li, Wei Li, Xudong Huang, Hanting Chen, Jie Hu, Yunhe Wang

Recent advances in diffusion models have improved Real-World Image Super-Resolution (Real-ISR), but existing methods lack human feedback integration, risking misalignment with human preference and may leading to artifacts, hallucinations and harmful content generation. To this end, we are the first to introduce human preference alignment into Real-ISR, a technique that has been successfully applied in Large Language Models and Text-to-Image tasks to effectively enhance the alignment of generated outputs with human preferences. Specifically, we introduce Direct Preference Optimization (DPO) into Real-ISR to achieve alignment, where DPO serves as a general alignment technique that directly learns from the human preference dataset. Nevertheless, unlike high-level tasks, the pixel-level reconstruction objectives of Real-ISR are difficult to reconcile with the image-level preferences of DPO, which can lead to the DPO being overly sensitive to local anomalies, leading to reduced generation quality. To resolve this dichotomy, we propose Direct Semantic Preference Optimization (DSPO) to align instance-level human preferences by incorporating semantic guidance, which is through two strategies: (a) semantic instance alignment strategy, implementing instance-level alignment to ensure fine-grained perceptual consistency, and (b) user description feedback strategy, mitigating hallucinations through semantic textual feedback on instance-level images. As a plug-and-play solution, DSPO proves highly effective in both one-step and multi-step SR frameworks.

最新的扩散模型进展已经提高了真实世界图像超分辨率（Real-ISR）的效果，但现有方法缺乏人类反馈整合，存在与人类偏好不一致的风险，并可能导致出现伪影、幻觉和有害内容生成。为此，我们首次将人类偏好对齐技术引入Real-ISR。该技术已成功应用于大型语言模型和文本到图像任务，可有效提高生成输出与人类偏好的对齐程度。具体来说，我们将直接偏好优化（DPO）引入Real-ISR以实现对齐，其中DPO作为一种通用的对齐技术，直接从人类偏好数据集中学习。然而，不同于高级任务，Real-ISR的像素级重建目标与DPO的图像级偏好很难协调，这可能导致DPO对局部异常过于敏感，从而降低生成质量。为了解决这一矛盾，我们提出直接语义偏好优化（DSPO），通过融入语义指导来对齐实例级人类偏好，这通过两个策略来实现：（a）语义实例对齐策略，实现实例级对齐以确保精细的感知一致性；（b）用户描述反馈策略，通过实例级图像的语义文本反馈来缓解幻觉。作为一种即插即用的解决方案，DSPO在单步和多步SR框架中都证明了其高度有效性。

论文及项目相关链接

PDF

Summary
最新扩散模型改进了真实世界图像超分辨率技术（Real-ISR），但现有方法未融入人类反馈，可能导致与人为偏好的错位以及图像生成时产生的伪影、幻象和有害内容。为此，我们首次将人类偏好对齐技术引入Real-ISR，该技术已成功应用于大型语言模型和文本到图像任务，能有效提升生成输出与人为偏好的对齐度。我们引入直接偏好优化（DPO）实现对齐，它是一种通用对齐技术，可直接从人类偏好数据集中学习。然而，不同于高级任务，Real-ISR的像素级重建目标与DPO的图像级偏好难以协调，可能导致DPO过于敏感于局部异常，降低生成质量。为解决这一矛盾，我们提出直接语义偏好优化（DSPO），通过融入语义指导来对齐实例级人为偏好，包括两个策略：一是语义实例对齐策略，实现实例级对齐以确保精细粒度感知一致性；二是用户描述反馈策略，通过实例级图像的语义文本反馈来减少幻象。DSPO作为一种即插即用解决方案，在单步和多步SR框架中都表现出高度有效性。

Key Takeaways

扩散模型的最新进展已改进了Real-ISR技术，但缺乏人类反馈集成可能导致与人为偏好的错位及生成问题。
人类偏好对齐技术首次被引入Real-ISR，提升生成输出与人为偏好的对齐度。
直接偏好优化（DPO）是一种通用对齐技术，可从人类偏好数据集中学习。
Real-ISR的像素级重建目标与DPO的图像级偏好存在协调困难，可能导致DPO过于敏感。
直接语义偏好优化（DSPO）通过融入语义指导解决上述问题，实现对实例级的偏好对齐。
DSPO包含两个策略：语义实例对齐和用户描述反馈策略，分别确保感知一致性和减少幻象。

Cool Papers

点此查看论文截图

The Synthetic Imputation Approach: Generating Optimal Synthetic Texts For Underrepresented Categories In Supervised Classification Tasks

Authors:Joan C. Timoneda

Encoder-decoder Large Language Models (LLMs), such as BERT and RoBERTa, require that all categories in an annotation task be sufficiently represented in the training data for optimal performance. However, it is often difficult to find sufficient examples for all categories in a task when building a high-quality training set. In this article, I describe this problem and propose a solution, the synthetic imputation approach. Leveraging a generative LLM (GPT-4o), this approach generates synthetic texts based on careful prompting and five original examples drawn randomly with replacement from the sample. This approach ensures that new synthetic texts are sufficiently different from the original texts to reduce overfitting, but retain the underlying substantive meaning of the examples to maximize out-of-sample performance. With 75 original examples or more, synthetic imputation’s performance is on par with a full sample of original texts, and overfitting remains low, predictable and correctable with 50 original samples. The synthetic imputation approach provides a novel role for generative LLMs in research and allows applied researchers to balance their datasets for best performance.

编码器-解码器大型语言模型（LLM），如BERT和RoBERTa，要求在标注任务的类别在训练数据中能够得到充分的表示，以实现最佳性能。然而，在构建高质量训练集时，通常很难为任务中的所有类别找到足够的示例。在本文中，我描述了这个问题并提出了一个解决方案，即合成插补方法。该方法利用生成式LLM（GPT-4o）生成基于精心提示的合成文本，并使用五个随机抽取（允许重复）的原始示例。这种方法确保新生成的合成文本与原始文本足够不同，以减少过度拟合，同时保留示例的基本实际意义，以最大化样本外的性能。使用75个原始示例或更多时，合成插补方法的性能与完整样本的原始文本相当，过度拟合仍然保持低水平、可预测并可借助原始样本进行修正。合成插补方法为生成式LLM的研究提供了一个新的角色，并允许应用研究人员为了获得最佳性能而平衡他们的数据集。

论文及项目相关链接

PDF

Summary

本文主要介绍了在大规模语言模型（LLM）的标注任务中遇到的类别数据不充分的问题。作者提出了合成填充方法来解决这个问题，利用生成式LLM（如GPT-4o）基于精心设计的提示和五个原始样本生成合成文本。这种方法确保了新的合成文本与原始文本足够不同，以减少过度拟合，同时保留样本的实质性意义，以最大化样本外的性能。使用75个以上的原始样本时，合成填充的性能与全样本的原始文本相当，并且过度拟合保持在低水平且可预测和可纠正。合成填充方法为生成式LLM在研究中提供了新的作用，并允许应用研究人员平衡他们的数据集以获得最佳性能。

Key Takeaways

大规模语言模型（LLM）在标注任务中需要充分代表所有类别以达到最佳性能。
当构建高质量训练集时，为所有类别找到足够的例子通常很困难。
合成填充方法通过使用生成式LLM（如GPT-4o）来解决这个问题。
该方法基于精心设计的提示和五个随机选择的原始样本生成合成文本。
合成文本与原始文本足够不同，以减少过度拟合，同时保留样本的实质性意义。
使用75个以上的原始样本时，合成填充的性能与全样本的原始文本相当。

Cool Papers

点此查看论文截图

KGMEL: Knowledge Graph-Enhanced Multimodal Entity Linking

Authors:Juyeon Kim, Geon Lee, Taeuk Kim, Kijung Shin

Entity linking (EL) aligns textual mentions with their corresponding entities in a knowledge base, facilitating various applications such as semantic search and question answering. Recent advances in multimodal entity linking (MEL) have shown that combining text and images can reduce ambiguity and improve alignment accuracy. However, most existing MEL methods overlook the rich structural information available in the form of knowledge-graph (KG) triples. In this paper, we propose KGMEL, a novel framework that leverages KG triples to enhance MEL. Specifically, it operates in three stages: (1) Generation: Produces high-quality triples for each mention by employing vision-language models based on its text and images. (2) Retrieval: Learns joint mention-entity representations, via contrastive learning, that integrate text, images, and (generated or KG) triples to retrieve candidate entities for each mention. (3) Reranking: Refines the KG triples of the candidate entities and employs large language models to identify the best-matching entity for the mention. Extensive experiments on benchmark datasets demonstrate that KGMEL outperforms existing methods. Our code and datasets are available at: https://github.com/juyeonnn/KGMEL.

实体链接（EL）将文本提及与知识库中的相应实体对齐，促进了语义搜索和问答等各种应用。多模态实体链接（MEL）的最新进展表明，结合文本和图像可以减少歧义，提高对齐准确性。然而，大多数现有的MEL方法忽视了知识图谱（KG）三元组中丰富的结构信息。在本文中，我们提出了KGMEL，这是一个利用KG三元组增强MEL的新型框架。具体来说，它分为三个阶段：（1）生成：通过基于文本和图像的视觉语言模型生成高质量的三元组。（2）检索：通过对比学习学习联合提及实体表示，将文本、图像和（生成或知识图谱）三元组合并，以检索每个提及的候选实体。（3）重新排序：精炼候选实体的KG三元组，并利用大型语言模型识别与提及内容最佳匹配的实体。在基准数据集上的广泛实验表明，KGMEL优于现有方法。我们的代码和数据集可通过以下网址获取：https://github.com/juyeonnn/KGMEL。

论文及项目相关链接

PDF SIGIR 2025 (Short)

Summary
文本介绍了实体链接（EL）和多模态实体链接（MEL）的基本概念及其在知识图谱（KG）中的应用。然而，现有的MEL方法忽略了知识图谱中的丰富结构信息。本文提出了一种新的框架KGMEL，利用KG三元组来增强MEL性能。KGMEL包含三个主要阶段：生成高质量的三元组，学习联合提及实体表示以及重新排序候选实体以识别最佳匹配实体。实验证明，KGMEL在基准数据集上的表现优于现有方法。

Key Takeaways

实体链接（EL）是将文本中的提及与知识库中的对应实体进行对齐的技术，多模态实体链接（MEL）结合了文本和图像来减少歧义并提高对齐准确性。
现有MEL方法忽视了知识图谱（KG）中的结构信息。
KGMEL是一个新的框架，利用KG三元组来增强MEL性能。
KGMEL包含三个阶段：生成高质量三元组、学习联合提及实体表示和重新排序候选实体。
KGMEL通过视觉语言模型和对比学习技术生成和检索高质量的三元组。
在基准数据集上的实验证明，KGMEL的性能优于现有的MEL方法。

Cool Papers

点此查看论文截图

EasyEdit2: An Easy-to-use Steering Framework for Editing Large Language Models

Authors:Ziwen Xu, Shuxun Wang, Kewei Xu, Haoming Xu, Mengru Wang, Xinle Deng, Yunzhi Yao, Guozhou Zheng, Huajun Chen, Ningyu Zhang

In this paper, we introduce EasyEdit2, a framework designed to enable plug-and-play adjustability for controlling Large Language Model (LLM) behaviors. EasyEdit2 supports a wide range of test-time interventions, including safety, sentiment, personality, reasoning patterns, factuality, and language features. Unlike its predecessor, EasyEdit2 features a new architecture specifically designed for seamless model steering. It comprises key modules such as the steering vector generator and the steering vector applier, which enable automatic generation and application of steering vectors to influence the model’s behavior without modifying its parameters. One of the main advantages of EasyEdit2 is its ease of use-users do not need extensive technical knowledge. With just a single example, they can effectively guide and adjust the model’s responses, making precise control both accessible and efficient. Empirically, we report model steering performance across different LLMs, demonstrating the effectiveness of these techniques. We have released the source code on GitHub at https://github.com/zjunlp/EasyEdit along with a demonstration notebook. In addition, we provide a demo video at https://zjunlp.github.io/project/EasyEdit2/video for a quick introduction.

本文介绍了EasyEdit2，这是一个旨在实现对大型语言模型（LLM）行为进行即插即用调整的框架。EasyEdit2支持广泛的测试时间干预，包括安全、情感、个性、推理模式、真实性和语言特征。不同于其前身，EasyEdit2采用专门设计用于无缝模型引导的新架构。它包含关键模块，如引导向量生成器和引导向量应用器，能够自动生成和应用引导向量，以影响模型的行为，而无需修改其参数。EasyEdit2的主要优点之一是使用方便-用户无需具备广泛的技术知识。只需一个示例，他们就可以有效地引导和调整模型的响应，使精确控制变得既方便又高效。我们通过实验报告了在不同LLM上的模型引导性能，证明了这些技术的有效性。我们已在GitHub上发布了源代码，网址为https://github.com/zjunlp/EasyEdit，并附带一个演示笔记本。此外，我们还提供了https://zjunlp.github.io/project/EasyEdit2/video演示视频，以供快速介绍。

Summary

EasyEdit2框架旨在实现即插即用可调整的大型语言模型（LLM）行为控制。它支持广泛的安全、情感、个性、推理模式、事实和语言特征的测试时间干预。新架构具备无缝模型转向功能，包含转向矢量生成器和转向矢量应用器等关键模块，可自动生成和应用转向矢量以影响模型行为，无需修改参数。EasyEdit2易于使用，用户无需深入了解技术细节，只需一个示例即可有效引导和调整模型响应。经验报告显示在不同LLM中模型转向性能效果显著。源码已发布在GitHub上，并附有演示笔记本和视频介绍。

Key Takeaways

EasyEdit2是一个用于控制大型语言模型行为的框架。
它支持多种测试时间干预，包括安全、情感、个性等。
新架构具有无缝模型转向功能。
含有转向矢量生成器和应用器等关键模块。
EasyEdit2能自动生成和应用转向矢量，无需修改模型参数。
该框架易于使用，用户只需简单示例即可调整模型响应。

Cool Papers

点此查看论文截图

Kuwain 1.5B: An Arabic SLM via Language Injection

Authors:Khalil Hennara, Sara Chrouf, Mohamed Motaism Hamed, Zeina Aldallal, Omar Hadid, Safwan AlModhayan

Enhancing existing models with new knowledge is a crucial aspect of AI development. This paper introduces a novel method for integrating a new language into a large language model (LLM). Our approach successfully incorporates a previously unseen target language into an existing LLM without compromising its prior knowledge. We trained a tiny model with 1.5 billion parameters named Kuwain by injecting the Arabic language into a small open-source model mainly trained in English. Our method demonstrates significant improvements in Arabic language performance, with an average 8% improvement across various benchmarks, while retaining the model’s existing knowledge with a minimum amount of the original model’s data. This offers a cost-effective alternative to training a comprehensive model in both English and Arabic. The results highlight the potential for efficient, targeted language model expansion without extensive retraining or resource-intensive processes.

在人工智能发展中，用新知识增强现有模型是一个至关重要的方面。本文介绍了一种将新语言集成到大型语言模型（LLM）中的新方法。我们的方法能够成功地将一种先前未见的目标语言（阿拉伯语言）并入到现有LLM中，而不会损害其原有知识。我们通过在主要为英语的小型开源模型中注入阿拉伯语，训练了一个名为“库瓦因”（Kuwain）的小型模型，该模型拥有1.5亿参数。我们的方法在阿拉伯语性能上取得了显著的提升，在各种基准测试中平均提高了8%，同时保留了模型原有的知识并使用了最小的原始模型数据量。这为在英语和阿拉伯语方面进行全面模型训练提供了经济高效的替代方案。结果突显了高效、有针对性的语言模型扩展的潜力，无需进行大规模重新训练或资源密集型过程。

论文及项目相关链接

PDF

Summary

本文介绍了一种将新语言融入大型语言模型（LLM）的新方法。该方法能够在不损失原有知识的前提下，成功将之前未见的目标语言融入现有LLM中。通过向主要英语训练的小型开源模型中注入阿拉伯语，训练出一个仅有1.5亿参数的微型模型——Kuwain。该方法在阿拉伯语表现上显著提高，在各种基准测试中平均提高8%，并且仅使用少量原始模型数据就保留住了模型的现有知识。这为在英语和阿拉伯语全面训练模型提供了一种成本效益高的替代方案。结果突显了高效、有针对性的语言模型扩展的潜力，无需大量重新训练或资源密集型过程。

Key Takeaways

论文提出了一种将新语言融入大型语言模型的新方法。
该方法能够在不损失原有知识的前提下融入新语言。
通过向主要英语训练的小型开源模型中注入阿拉伯语，创建了一个名为Kuwain的微型模型。
Kuwain模型在阿拉伯语表现上显著提高，平均提高8%。
该方法使用少量原始模型数据就实现了知识的保留。
这为在英语和阿拉伯语全面训练模型提供了成本效益高的替代方案。

Cool Papers

点此查看论文截图

Empowering AI to Generate Better AI Code: Guided Generation of Deep Learning Projects with LLMs

Authors:Chen Xie, Mingsheng Jiao, Xiaodong Gu, Beijun Shen

While large language models (LLMs) have been widely applied to code generation, they struggle with generating entire deep learning projects, which are characterized by complex structures, longer functions, and stronger reliance on domain knowledge than general-purpose code. An open-domain LLM often lacks coherent contextual guidance and domain expertise for specific projects, making it challenging to produce complete code that fully meets user requirements. In this paper, we propose a novel planning-guided code generation method, DLCodeGen, tailored for generating deep learning projects. DLCodeGen predicts a structured solution plan, offering global guidance for LLMs to generate the project. The generated plan is then leveraged to retrieve semantically analogous code samples and subsequently abstract a code template. To effectively integrate these multiple retrieval-augmented techniques, a comparative learning mechanism is designed to generate the final code. We validate the effectiveness of our approach on a dataset we build for deep learning code generation. Experimental results demonstrate that DLCodeGen outperforms other baselines, achieving improvements of 9.7% in CodeBLEU and 3.6% in human evaluation metrics.

虽然大型语言模型（LLM）已经广泛应用于代码生成，但它们在生成整个深度学习项目时面临挑战。深度学习项目具有复杂结构、更长的功能和比通用代码更强烈的领域知识依赖。开放领域的LLM通常缺乏连贯的上下文指导和特定项目的专业知识，因此难以生成完全满足用户需求的完整代码。在本文中，我们提出了一种新型的规划引导代码生成方法DLCodeGen，专门用于生成深度学习项目。DLCodeGen预测结构化解决方案计划，为LLM生成项目提供全局指导。生成的计划随后被用来检索语义相似的代码样本，然后抽象出代码模板。为了有效地整合这些多重检索增强技术，设计了一种对比学习机制来生成最终代码。我们在为深度学习代码生成构建的数据集上验证了我们的方法的有效性。实验结果表明，DLCodeGen优于其他基线方法，在CodeBLEU指标上提高了9.7%，在人类评价指标上提高了3.6%。

论文及项目相关链接

PDF

Summary

基于大型语言模型（LLM）在代码生成中的广泛应用，针对生成整个深度学习项目所面临的挑战，本文提出了一种新型规划引导的代码生成方法DLCodeGen。DLCodeGen能够预测结构化解决方案计划，为LLM生成项目提供全局指导。通过利用生成的计划，检索语义相似的代码样本并抽象出代码模板。为了有效整合多种检索增强技术，设计了一种对比学习机制以生成最终代码。在自建的深度学习代码生成数据集上进行验证，实验结果表明DLCodeGen优于其他基线方法，在CodeBLEU指标上提高了9.7%，在人类评估指标上提高了3.6%。

Key Takeaways

大型语言模型（LLM）在代码生成方面已得到广泛应用，但在生成深度学习项目时面临挑战，因为这些项目具有复杂结构、更长的功能和更强的领域知识依赖。
DLCodeGen是一种针对深度学习项目的规划引导代码生成方法，能够预测结构化解决方案计划，为LLM生成项目提供全局指导。
DLCodeGen通过利用生成的计划，结合检索增强技术，从语义相似的代码样本中抽象出代码模板。
为了整合多种检索增强技术，DLCodeGen设计了一种对比学习机制。
实验结果表明，DLCodeGen在深度学习代码生成方面优于其他基线方法。
DLCodeGen在CodeBLEU指标上提高了9.7%，显示出其在代码生成方面的有效性。

Cool Papers

点此查看论文截图

ParaPO: Aligning Language Models to Reduce Verbatim Reproduction of Pre-training Data

Authors:Tong Chen, Faeze Brahman, Jiacheng Liu, Niloofar Mireshghallah, Weijia Shi, Pang Wei Koh, Luke Zettlemoyer, Hannaneh Hajishirzi

Language models (LMs) can memorize and reproduce segments from their pretraining data verbatim even in non-adversarial settings, raising concerns about copyright, plagiarism, privacy, and creativity. We introduce Paraphrase Preference Optimization (ParaPO), a post-training method that fine-tunes LMs to reduce unintentional regurgitation while preserving their overall utility. ParaPO trains LMs to prefer paraphrased versions of memorized segments over the original verbatim content from the pretraining data. To maintain the ability to recall famous quotations when appropriate, we develop a variant of ParaPO that uses system prompts to control regurgitation behavior. In our evaluation on Llama3.1-8B, ParaPO consistently reduces regurgitation across all tested datasets (e.g., reducing the regurgitation metric from 17.3 to 12.9 in creative writing), whereas unlearning methods used in prior work to mitigate regurgitation are less effective outside their targeted unlearned domain (from 17.3 to 16.9). When applied to the instruction-tuned Tulu3-8B model, ParaPO with system prompting successfully preserves famous quotation recall while reducing unintentional regurgitation (from 8.7 to 6.3 in creative writing) when prompted not to regurgitate. In contrast, without ParaPO tuning, prompting the model not to regurgitate produces only a marginal reduction (8.7 to 8.4).

语言模型（LM）能够在非对抗性环境中逐字逐句地记忆和复制其预训练数据中的片段，这引发了人们对版权、剽窃、隐私和创造力的担忧。我们引入了Paraphrase Preference Optimization（ParaPO）这一后训练法，对语言模型进行微调，以减少无意识的复述现象，同时保留其整体效用。ParaPO训练语言模型优先选择对记忆片段进行同义改写，而非使用预训练数据中的原始内容。为了保持适时回忆著名引语的能力，我们开发了一种使用系统提示来控制复述行为的ParaPO变体。在对Llama3.1-8B的评估中，ParaPO在所有测试数据集上都能持续减少复述现象（例如，在创造性写作方面将复述指标从17.3降至12.9），而先前工作中用于缓解复述的去学习法在其未针对的领域外效果较差（从17.3降至16.9）。当应用于指令调优的Tulu3-8B模型时，带有系统提示的ParaPO成功保留了著名引语的回忆能力，同时减少了无意识的复述现象（在创造性写作方面从8.7降至6.3），当提示不要复述时效果更佳。相比之下，没有使用ParaPO调优，仅提示模型不要复述，效果仅略有降低（从8.7降至8.4）。

论文及项目相关链接

PDF

Summary

本文介绍了语言模型（LMs）即使在非对抗性环境中也能记忆并再现预训练数据中的片段，引发了关于版权、剽窃、隐私和创造力的担忧。为此，文章提出了一种名为Paraphrase Preference Optimization（ParaPO）的后训练法，旨在通过微调LM来减少无意识的复述，同时保持其整体效用。ParaPO训练LM更倾向于使用预训练数据中记忆片段的同义版本而非原样内容。为了能在适当时候回忆著名引文，文章还开发了带有系统提示功能的ParaPO变体。在Llama3.1-8B上的评估显示，ParaPO在所有测试数据集上都能有效减少复述（例如，在创造性写作中将复述指标从17.3降至12.9），而先前工作中用于减轻复述的遗忘方法在非目标遗忘领域之外的效果较差（从17.3降至16.9）。当应用于指令调整的Tulu3-8B模型时，带有系统提示的ParaPO成功保留了著名引文的回忆，同时减少了无意识的复述（在创造性写作中从8.7降至6.3），而不使用ParaPO提示模型减少复述只产生轻微的效果（从8.7降至8.4）。

Key Takeaways

语言模型（LMs）能记忆并再现预训练数据中的片段，引发关于版权、剽窃等问题的担忧。
提出Paraphrase Preference Optimization（ParaPO）后训练法，旨在减少LM的无意识复述，同时保持其整体效能。
ParaPO通过训练LM更倾向于使用同义版本而非原样内容来减少复述。
开发了带有系统提示功能的ParaPO变体，以在适当时候回忆著名引文。
在Llama3.1-8B上的评估显示，ParaPO能有效减少复述。
与先前的遗忘方法相比，ParaPO在减少复述方面表现出更好的效果。

Cool Papers

点此查看论文截图

Empirical Evaluation of Knowledge Distillation from Transformers to Subquadratic Language Models

Authors:Patrick Haller, Jonas Golde, Alan Akbik

Knowledge distillation is a widely used technique for compressing large language models (LLMs) by training a smaller student model to mimic a larger teacher model. Typically, both the teacher and student are Transformer-based architectures, leveraging softmax attention for sequence modeling. However, the quadratic complexity of self-attention at inference time remains a significant bottleneck, motivating the exploration of subquadratic alternatives such as structured state-space models (SSMs), linear attention, and recurrent architectures. In this work, we systematically evaluate the transferability of knowledge distillation from a Transformer teacher to nine subquadratic student architectures. Our study aims to determine which subquadratic model best aligns with the teacher’s learned representations and how different architectural constraints influence the distillation process. We also investigate the impact of intelligent initialization strategies, including matrix mixing and query-key-value (QKV) copying, on the adaptation process. Our empirical results on multiple NLP benchmarks provide insights into the trade-offs between efficiency and performance, highlighting key factors for successful knowledge transfer to subquadratic architectures.

知识蒸馏是一种广泛应用于压缩大型语言模型（LLM）的技术，通过训练较小的学生模型来模仿较大的教师模型。通常，教师和学生的架构都是基于Transformer的，利用softmax注意力进行序列建模。然而，推理时自注意力的二次复杂性仍然是显著瓶颈，这促使人们探索次二次替代方案，如结构化状态空间模型（SSMs）、线性注意力和循环架构。在这项工作中，我们系统地评估了从Transformer教师到九种子二次学生架构的知识蒸馏的迁移性。我们的研究旨在确定哪个子二次模型最能与教师的已学表示对齐，以及不同的架构约束如何影响蒸馏过程。我们还研究了智能初始化策略，包括矩阵混合和查询-键值（QKV）复制对适应过程的影响。我们在多个NLP基准测试上的实证结果为效率和性能之间的权衡提供了见解，并突出了成功转移知识到子二次架构的关键因素。

论文及项目相关链接

PDF

摘要

知识蒸馏是一种广泛用于压缩大型语言模型（LLM）的技术，通过训练小型的学生模型来模仿大型的教师模型。通常，教师和学生都是基于Transformer的架构，利用softmax注意力进行序列建模。然而，自注意力在推理时间上的二次复杂性仍是显著瓶颈，促使人们探索次二次替代方案，如结构化状态空间模型（SSM）、线性注意力和递归架构。在这项工作中，我们系统地评估了从Transformer教师到九种子二次学生架构的知识蒸馏的迁移性。我们的研究旨在确定哪个次二次模型最能与教师的表示对齐，以及不同的架构约束如何影响蒸馏过程。我们还研究了智能初始化策略，包括矩阵混合和查询-键-值（QKV）复制对适应过程的影响。我们在多个NLP基准测试上的实证结果为效率和性能之间的权衡提供了见解，并突出了成功将知识转移到次二次架构的关键因素。

关键见解

知识蒸馏是压缩大型语言模型的有效技术，通过训练小型模型来模仿大型模型。
目前存在推理时间自注意力的二次复杂性瓶颈，需要探索次二次模型替代方案。
系统评估了知识蒸馏从Transformer教师模型到多种次二次学生模型的迁移性。
研究了不同架构约束对蒸馏过程的影响。
智能初始化策略，如矩阵混合和QKV复制，对适应过程有重要影响。
实证结果揭示了效率和性能之间的权衡。

Cool Papers

点此查看论文截图

Using customized GPT to develop prompting proficiency in architectural AI-generated images

Authors:Juan David Salazar Rodriguez, Sam Conrad Joyce, Julfendi Julfendi

This research investigates the use of customized GPT models to enhance prompting proficiency among architecture students when generating AI-driven images. Prompt engineering is increasingly essential in architectural education due to the widespread adoption of generative AI tools. This study utilized a mixed-methods experimental design involving architecture students divided into three distinct groups: a control group receiving no structured support, a second group provided with structured prompting guides, and a third group supported by both structured guides and interactive AI personas. Students engaged in reverse engineering tasks, first guessing provided image prompts and then generating their own prompts, aiming to boost critical thinking and prompting skills. Variables examined included time spent prompting, word count, prompt similarity, and concreteness. Quantitative analysis involved correlation assessments between these variables and a one-way ANOVA to evaluate differences across groups. While several correlations showed meaningful relationships, not all were statistically significant. ANOVA results indicated statistically significant improvements in word count, similarity, and concreteness, especially in the group supported by AI personas and structured prompting guides. Qualitative feedback complemented these findings, revealing enhanced confidence and critical thinking skills in students. These results suggest tailored GPT interactions substantially improve students’ ability to communicate architectural concepts clearly and effectively.

本研究探讨了使用定制的GPT模型在提高建筑学生在生成AI驱动图像时的提示能力方面的应用。由于生成式AI工具的广泛应用，提示工程在建筑教育中变得越来越重要。本研究采用混合方法实验设计，将建筑学生分为三组：对照组无结构化支持，第二组提供结构化提示指南，第三组由结构化指南和交互式AI人格提供支持。学生们参与了逆向工程任务，首先猜测提供的图像提示，然后生成自己的提示，旨在提高批判思维和提示技能。研究的变量包括提示所花费的时间、字数、提示的相似性和具体性。定量分析包括这些变量之间的相关性评估，以及单因素方差分析，以评估各组之间的差异。虽然有几个相关性显示出有意义的关系，但并非所有关系在统计上都是显著的。方差分析结果表明，在字数、相似性和具体性方面有统计显著的改进，特别是那些受到AI人格和结构化提示指南支持的小组。定性反馈补充了这些发现，显示学生的自信和批判思维能力有所提高。这些结果表明，量身定制的GPT互动能显著提高学生清晰有效地传达建筑概念的能力。

论文及项目相关链接

PDF

Summary

本研究探讨了使用定制GPT模型在提高建筑专业学生生成AI驱动图像时的提示能力。随着生成式AI工具的广泛应用，提示工程在建筑设计教育中的重要性日益增加。该研究采用混合方法实验设计，将建筑专业学生分为三组：对照组无结构化支持，第二组提供结构化提示指南，第三组由结构化指南和交互式AI人格提供支持。学生们参与逆向工程任务，首先猜测提供的图像提示，然后生成自己的提示，旨在提高批判思维和提示技能。研究的变量包括提示花费的时间、字数、提示的相似性和具体性。定量分析了这些变量之间的相关性，并通过单向方差分析评估了各组之间的差异。虽然一些相关性显示出有意义的关系，但并非都具有统计学意义。方差分析结果显示，在字数、相似性和具体性方面存在统计学上的显著改善，特别是在AI人格和结构化提示指南的支持下。定性反馈证实了这些发现，学生们显示出增强的信心和批判思维能力。这些结果表明，定制GPT交互能显著提高学生清晰有效地传达建筑概念的能力。

Key Takeaways

研究聚焦在利用定制GPT模型提升建筑专业学生使用AI工具时的提示能力。
实验中，学生分为三组，分别接受不同水平的支持（无支持、结构化指南支持、结构化指南与AI人格支持）。
学生们参与反向工程任务，锻炼猜测和生成提示的能力，旨在提升批判思维和提示技能。
研究定量分析了学生提示花费的时间、字数、提示的相似性和具体性等变量。
方差分析显示，AI支持和结构化指南下的学生在字数、提示的相似性和具体性方面有显著改善。
定性反馈表明，学生们的自信和批判思维能力有所提升。

Cool Papers

点此查看论文截图

Causal-Copilot: An Autonomous Causal Analysis Agent

Authors:Xinyue Wang, Kun Zhou, Wenyi Wu, Har Simrat Singh, Fang Nan, Songyao Jin, Aryan Philip, Saloni Patnaik, Hou Zhu, Shivam Singh, Parjanya Prashant, Qian Shen, Biwei Huang

Causal analysis plays a foundational role in scientific discovery and reliable decision-making, yet it remains largely inaccessible to domain experts due to its conceptual and algorithmic complexity. This disconnect between causal methodology and practical usability presents a dual challenge: domain experts are unable to leverage recent advances in causal learning, while causal researchers lack broad, real-world deployment to test and refine their methods. To address this, we introduce Causal-Copilot, an autonomous agent that operationalizes expert-level causal analysis within a large language model framework. Causal-Copilot automates the full pipeline of causal analysis for both tabular and time-series data – including causal discovery, causal inference, algorithm selection, hyperparameter optimization, result interpretation, and generation of actionable insights. It supports interactive refinement through natural language, lowering the barrier for non-specialists while preserving methodological rigor. By integrating over 20 state-of-the-art causal analysis techniques, our system fosters a virtuous cycle – expanding access to advanced causal methods for domain experts while generating rich, real-world applications that inform and advance causal theory. Empirical evaluations demonstrate that Causal-Copilot achieves superior performance compared to existing baselines, offering a reliable, scalable, and extensible solution that bridges the gap between theoretical sophistication and real-world applicability in causal analysis. A live interactive demo of Causal-Copilot is available at https://causalcopilot.com/.

因果分析在科学发现和可靠决策制定中发挥着基础作用，然而由于其概念和算法的复杂性，领域专家很难获得这一知识。因果方法论与实际使用之间的脱节带来了双重挑战：领域专家无法利用因果学习方面的最新进展，而因果研究者缺乏广泛的现实世界部署来测试和完善他们的方法。为了解决这一问题，我们引入了因果Copilot，这是一个在大语言模型框架内操作专家级因果分析的自主体。因果Copilot自动完成因果分析的整个流程，无论是表格数据还是时间序列数据，包括因果发现、因果推理、算法选择、超参数优化、结果解释和生成可操作见解。它通过自然语言支持交互式改进，降低了非专业人士的门槛，同时保持了方法的严谨性。通过集成超过20项最先进的因果分析技术，我们的系统形成了一个良性循环——扩大领域专家对高级因果方法的访问权限，同时生成丰富的现实世界应用程序，为因果理论提供信息和促进其发展。实证研究证明，与现有基准相比，因果Copilot性能卓越，提供了一种可靠、可扩展和可扩展的解决方案，缩小了理论复杂性和现实世界应用之间的鸿沟。因果Copilot的实时交互式演示网站为：[https://causalcopilot.com/] 。

论文及项目相关链接

PDF

Summary

本文介绍了因果分析的重要性和挑战，包括其在科学发现和可靠决策制定中的基础作用，以及领域专家难以利用因果学习方法的问题。为解决这一问题，文章提出了一种名为Causal-Copilot的自主代理方案，可在大型语言模型框架内进行专家级别的因果分析。该方案可自动化因果分析的全流程，包括因果发现、因果推理、算法选择、超参数优化、结果解读和行动建议生成等。此外，它还支持通过自然语言进行交互式改进，降低了非专业人士的门槛，同时保持了方法论上的严谨性。通过整合超过20项最先进的因果分析技术，该系统形成了一个良性循环，扩大了领域专家对高级因果方法的访问权限，同时生成丰富的实际应用案例以推动因果理论的发展。实证评估表明，Causal-Copilot相较于现有基线方案表现出卓越性能，为因果分析领域搭建了理论先进性与实际应用之间的桥梁。

Key Takeaways

因果分析在科学发现和决策制定中具有重要作用，但领域专家难以利用最新的因果学习方法。
Causal-Copilot解决了这一问题，通过自主代理实现了专家级别的因果分析。
Causal-Copilot可以自动化因果分析的全流程，从因果发现到行动建议生成。
该系统支持自然语言交互，既简化了非专业人士的使用难度，又保持了专业严谨性。
系统整合了多种最先进的因果分析技术，促进了理论与实践的良性循环。
Causal-Copilot在实证评估中表现出卓越性能，超越了现有基线方案。

Cool Papers

点此查看论文截图

Forecasting from Clinical Textual Time Series: Adaptations of the Encoder and Decoder Language Model Families

Authors:Shahriar Noroozizadeh, Sayantan Kumar, Jeremy C. Weiss

Clinical case reports encode rich, temporal patient trajectories that are often underexploited by traditional machine learning methods relying on structured data. In this work, we introduce the forecasting problem from textual time series, where timestamped clinical findings – extracted via an LLM-assisted annotation pipeline – serve as the primary input for prediction. We systematically evaluate a diverse suite of models, including fine-tuned decoder-based large language models and encoder-based transformers, on tasks of event occurrence prediction, temporal ordering, and survival analysis. Our experiments reveal that encoder-based models consistently achieve higher F1 scores and superior temporal concordance for short- and long-horizon event forecasting, while fine-tuned masking approaches enhance ranking performance. In contrast, instruction-tuned decoder models demonstrate a relative advantage in survival analysis, especially in early prognosis settings. Our sensitivity analyses further demonstrate the importance of time ordering, which requires clinical time series construction, as compared to text ordering, the format of the text inputs that LLMs are classically trained on. This highlights the additional benefit that can be ascertained from time-ordered corpora, with implications for temporal tasks in the era of widespread LLM use.

临床病例报告包含了丰富的、随时间变化的患者轨迹信息，而传统的依赖结构化数据的机器学习方法往往未能充分利用这些信息。在这项研究中，我们从文本时间序列中引入了预测问题，其中通过大型语言模型辅助的标注管道提取的时间戳临床发现作为预测的主要输入。我们对一系列模型进行了系统评估，包括微调过的基于解码器的大型语言模型和基于编码器的转换器，用于事件发生的预测、时间顺序和生存分析任务。我们的实验表明，基于编码器的模型在短期和长期事件预测中始终获得更高的F1分数和时间一致性，而经过微调过的掩码方法提高了排名性能。相比之下，经过指令训练的解码器模型在生存分析中显示出相对优势，特别是在早期预后环境中。我们的敏感性分析进一步表明时间顺序的重要性，这需要临床时间序列的构建，与文本顺序相对，后者是大型语言模型经典训练中输入文本的格式。这突显了从时间顺序语料库中获得的额外好处，对广泛使用大型语言模型的时代的时序任务具有启示意义。

论文及项目相关链接

PDF Machine Learning for Healthcare (MLHC 2025)

Summary

本文介绍了临床病例报告中的时间序贯性数据预测问题。利用大型语言模型辅助的标注管道提取的时间戳临床发现作为主要输入进行预测。实验表明，编码器模型在事件发生的短期和长期预测中，F1分数较高且时间一致性更优；微调后的掩码方法能提高排序性能。相比之下，指令微调解码器模型在生存分析中表现出相对优势，特别是在早期预后设置中。同时强调了时间顺序的重要性，并指出传统的大型语言模型在文本输入格式上更侧重于文本顺序而非时间顺序的限制。因此，构建时间顺序语料库对于时代中时间任务尤为重要。

Key Takeaways

临床病例报告包含丰富的时序性患者轨迹信息，但传统机器学习方法往往未能充分利用。
利用大型语言模型辅助标注管道提取时间戳临床发现，为预测提供主要输入。
编码器模型在事件预测、时序排序和生存分析任务中表现优越。
微调后的掩码方法能提高排序性能。
指令微调解码器模型在生存分析中具有相对优势。
时间顺序对预测结果至关重要，需要构建临床时间序列。

Cool Papers

点此查看论文截图

STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding?

Authors:Yun Li, Yiming Zhang, Tao Lin, XiangRui Liu, Wenxiao Cai, Zheng Liu, Bo Zhao

The use of Multimodal Large Language Models (MLLMs) as an end-to-end solution for Embodied AI and Autonomous Driving has become a prevailing trend. While MLLMs have been extensively studied for visual semantic understanding tasks, their ability to perform precise and quantitative spatial-temporal understanding in real-world applications remains largely unexamined, leading to uncertain prospects. To evaluate models’ Spatial-Temporal Intelligence, we introduce STI-Bench, a benchmark designed to evaluate MLLMs’ spatial-temporal understanding through challenging tasks such as estimating and predicting the appearance, pose, displacement, and motion of objects. Our benchmark encompasses a wide range of robot and vehicle operations across desktop, indoor, and outdoor scenarios. The extensive experiments reveals that the state-of-the-art MLLMs still struggle in real-world spatial-temporal understanding, especially in tasks requiring precise distance estimation and motion analysis.

使用多模态大型语言模型（MLLMs）作为嵌入式人工智能和自动驾驶的端到端解决方案已成为一种流行趋势。虽然MLLMs在视觉语义理解任务方面已被广泛研究，但它们在现实应用中进行精确和定量时空理解的能力尚未得到充分检验，因此存在不确定的前景。为了评估模型的时空智能，我们引入了STI-Bench，这是一个旨在通过估计和预测物体的外观、姿态、位移和运动等具有挑战性的任务来评估MLLMs的时空理解能力的基准测试。我们的基准测试涵盖了桌面、室内和室外场景中的广泛机器人和车辆操作。大量实验表明，最先进的MLLMs在真实世界的时空理解方面仍然存在困难，特别是在需要精确距离估计和运动分析的任务中。

论文及项目相关链接

PDF

总结

使用多模态大型语言模型（MLLMs）作为端到端的解决方案，为智能体和自动驾驶提供了流行趋势。虽然MLLMs在视觉语义理解任务方面已经得到了广泛的研究，但在现实世界应用中实现精确和定量时空理解的能力仍有待研究，导致前景不明朗。为了评估模型的时空智能，我们引入了STI-Bench基准测试平台，用于通过诸如估计和预测物体的外观、姿态、位移和运动等具有挑战性的任务来评估MLLMs的时空理解能力。该基准测试平台涵盖了桌面、室内和室外场景的广泛机器人和车辆操作范围。实验表明，最先进的MLLMs在真实世界的时空理解方面仍然存在困难，特别是在需要精确距离估计和运动分析的任务中。

要点掌握