LLM

发布日期: 2025-01-25

更新日期: 2025-02-12

文章字数: 18.9k

阅读时长: 77 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-01-25 更新

CRPO: Confidence-Reward Driven Preference Optimization for Machine Translation

Authors:Guofeng Cui, Pichao Wang, Yang Liu, Zemian Ke, Zhu Liu, Vimal Bhat

Large language models (LLMs) have shown great potential in natural language processing tasks, but their application to machine translation (MT) remains challenging due to pretraining on English-centric data and the complexity of reinforcement learning from human feedback (RLHF). Direct Preference Optimization (DPO) has emerged as a simpler and more efficient alternative, but its performance depends heavily on the quality of preference data. To address this, we propose Confidence-Reward driven Preference Optimization (CRPO), a novel method that combines reward scores with model confidence to improve data selection for fine-tuning. CRPO selects challenging sentence pairs where the model is uncertain or underperforms, leading to more effective learning. While primarily designed for LLMs, CRPO also generalizes to encoder-decoder models like NLLB, demonstrating its versatility. Empirical results show that CRPO outperforms existing methods such as RS-DPO, RSO and MBR score in both translation accuracy and data efficiency.

大型语言模型（LLM）在自然语言处理任务中显示出巨大潜力，但将其应用于机器翻译（MT）仍然具有挑战性，这主要是由于以英语为中心的预训练数据和从人类反馈中进行强化学习（RLHF）的复杂性。直接偏好优化（DPO）作为一种更简单、更高效的选择方法已经出现，但其性能在很大程度上取决于偏好数据的质量。为了解决这一问题，我们提出了基于信心奖励的偏好优化（CRPO），这是一种结合奖励分数和模型信心以提高微调数据选择的新方法。CRPO选择模型不确定或表现不佳的具有挑战性的句子对，从而实现更有效的学习。虽然CRPO主要设计用于大型语言模型，但它也适用于编码器解码器模型，如NLLB，显示出其通用性。经验结果表明，CRPO在翻译准确性和数据效率方面都优于现有方法，如RS-DPO、RSO和MBR评分。

论文及项目相关链接

PDF

Summary

大型语言模型在自然语言处理任务中展现出巨大潜力，但在机器翻译方面的应用仍面临挑战。直接偏好优化（DPO）作为一种更简单高效的方法应运而生，但其性能取决于偏好数据的质量。为解决这一问题，我们提出信心奖励驱动偏好优化（CRPO）方法，结合奖励分数和模型信心改善数据选择，用于微调。CRPO选择模型不确定或表现不佳的难句对，实现更有效的学习。CRPO不仅适用于大型语言模型，也可推广至编码器解码器模型如NLLB，显示出其通用性。实证结果显示，CRPO在翻译准确性和数据效率方面优于现有方法如RS-DPO、RSO和MBR分数。

Key Takeaways

大型语言模型在机器翻译方面应用面临挑战，包括基于英语的数据预训练和强化学习的问题。
直接偏好优化（DPO）作为解决上述问题的简化方法被提出，但性能受限于偏好数据质量。
提出信心奖励驱动偏好优化（CRPO）方法，结合奖励分数和模型信心改善数据选择过程。
CRPO选择模型不确定或表现不佳的难句对进行更有效的学习。
CRPO不仅适用于大型语言模型，也可应用于编码器解码器模型如NLLB。
实证结果显示CRPO在翻译准确性和数据效率方面优于现有方法。

Cool Papers

点此查看论文截图

Privacy-Preserving Personalized Federated Prompt Learning for Multimodal Large Language Models

Authors:Linh Tran, Wei Sun, Stacy Patterson, Ana Milanova

Multimodal Large Language Models (LLMs) are pivotal in revolutionizing customer support and operations by integrating multiple modalities such as text, images, and audio. Federated Prompt Learning (FPL) is a recently proposed approach that combines pre-trained multimodal LLMs such as vision-language models with federated learning to create personalized, privacy-preserving AI systems. However, balancing the competing goals of personalization, generalization, and privacy remains a significant challenge. Over-personalization can lead to overfitting, reducing generalizability, while stringent privacy measures, such as differential privacy, can hinder both personalization and generalization. In this paper, we propose a Differentially Private Federated Prompt Learning (DP-FPL) approach to tackle this challenge by leveraging a low-rank adaptation scheme to capture generalization while maintaining a residual term that preserves expressiveness for personalization. To ensure privacy, we introduce a novel method where we apply local differential privacy to the two low-rank components of the local prompt, and global differential privacy to the global prompt. Our approach mitigates the impact of privacy noise on the model performance while balancing the tradeoff between personalization and generalization. Extensive experiments demonstrate the effectiveness of our approach over other benchmarks.

多模态大型语言模型（LLM）通过整合文本、图像和音频等多种模态，在革新客户支持和运营方面发挥着关键作用。联邦提示学习（FPL）是一种最近提出的方法，它将预训练的多模态LLM（如视觉语言模型）与联邦学习相结合，以创建个性化、保护隐私的人工智能系统。然而，在个性化、通用化和隐私之间取得平衡仍然是一个巨大的挑战。过度个性化可能导致过度拟合，降低通用性，而严格的隐私措施（如差分隐私）可能会阻碍个性化和通用化。在本文中，我们提出了一种差分私有联邦提示学习（DP-FPL）方法，通过利用低阶适应方案来捕捉通用性，同时保持一个残留项来保持个性化的表现力，以应对这一挑战。为确保隐私，我们介绍了一种新方法，对局部提示的两个低阶分量应用本地差分隐私，对全局提示应用全局差分隐私。我们的方法减轻了隐私噪声对模型性能的影响，在个性化与通用化之间取得了平衡。大量实验表明，我们的方法在其他基准测试上非常有效。

论文及项目相关链接

PDF Accepted to ICLR 2025 main conference track

Summary:
多媒体大语言模型通过集成文本、图像和音频等多媒体模式，在客户支持和运营领域掀起革命。本文提出一种基于差分隐私的联邦提示学习方法（DP-FPL），通过低阶适应方案实现个性化与泛化之间的平衡，同时采用本地差分隐私和全局差分隐私保护机制，有效减轻隐私噪声对模型性能的影响。

Key Takeaways:

多模态大语言模型在客户支持和运营中具有重要作用。
联邦提示学习是一种将预训练的多模态大语言模型与联邦学习相结合的方法，用于创建个性化、保护隐私的人工智能系统。
平衡个性化、泛化和隐私是面临的主要挑战。
过度的个性化可能导致过度拟合，降低泛化能力。
严格的隐私措施，如差分隐私，可能会影响个性化和泛化。
DP-FPL方法通过低阶适应方案平衡个性化和泛化，同时采用本地和全局差分隐私保护机制确保隐私。

Cool Papers

点此查看论文截图

Exploring Finetuned Audio-LLM on Heart Murmur Features

Authors:Adrian Florea, Xilin Jiang, Nima Mesgarani, Xiaofan Jiang

Large language models (LLMs) for audio have excelled in recognizing and analyzing human speech, music, and environmental sounds. However, their potential for understanding other types of sounds, particularly biomedical sounds, remains largely underexplored despite significant scientific interest. In this study, we focus on diagnosing cardiovascular diseases using phonocardiograms, i.e., heart sounds. Most existing deep neural network (DNN) paradigms are restricted to heart murmur classification (healthy vs unhealthy) and do not predict other acoustic features of the murmur such as timing, grading, harshness, pitch, and quality, which are important in helping physicians diagnose the underlying heart conditions. We propose to finetune an audio LLM, Qwen2-Audio, on the PhysioNet CirCor DigiScope phonocardiogram (PCG) dataset and evaluate its performance in classifying 11 expert-labeled murmur features. Additionally, we aim to achieve more noise-robust and generalizable system by exploring a preprocessing segmentation algorithm using an audio representation model, SSAMBA. Our results indicate that the LLM-based model outperforms state-of-the-art methods in 8 of the 11 features and performs comparably in the remaining 3. Moreover, the LLM successfully classifies long-tail murmur features with limited training data, a task that all previous methods have failed to classify. These findings underscore the potential of audio LLMs as assistants to human cardiologists in enhancing heart disease diagnosis.

大型语言模型（LLM）在音频领域已经能够出色地识别和解析人类语音、音乐和环境声音。然而，尽管科学界对此有浓厚的兴趣，它们对于理解其他类型的声音，尤其是生物医学声音，其潜力仍被大大低估。在这项研究中，我们专注于使用心音图（即心脏声音）来诊断心血管疾病。现有的大多数深度神经网络（DNN）模式仅限于心脏杂音分类（健康与否），无法预测杂音的其他声学特征，如时间、等级、严厉程度、音高和音质等，这些特征对于帮助医生诊断潜在的心脏状况非常重要。我们提议对音频LLM模型Qwen2-Audio进行微调，并在PhysioNet CirCor DigiScope心音图数据集上评估其性能，以分类专家标注的11种杂音特征。此外，我们还希望通过使用音频表示模型的预处理分割算法SSAMBA，建立一个更加稳健且通用的系统。我们的结果表明，基于LLM的模型在其中的八个特征中表现出优于最新方法的效果，并在其余三个特征中表现相当。此外，LLM能够成功地对长尾杂音特征进行分类，这些特征以前的方法都无法处理。这些发现突显了音频LLM作为人类心脏病学家的辅助工具在增强心脏病诊断方面的潜力。

论文及项目相关链接

PDF 5 pages, 1 figure, and 3 tables. Submitted to IEEE/ACM Conference on Connected Health: Applications, Systems , and Engineering Technologies

摘要

大型语言模型（LLM）在音频领域已展现出对人脸、音乐和自然环境的优异识别与分析能力。尽管生物医学领域存在对该技术探索的兴趣，但该模型理解其它类型声音如医疗领域生物医学声音的潜力仍未得到广泛研究。本研究关注心脏病诊断的语音图谱领域。多数现有神经网络框架局限于心脏杂音的分类（健康与不健康），且未对其他声学特性如时序性、分级度、猛烈程度等进行预测分析。而这些因素对医生的准确诊断起到重要作用。本研究的首要目的是针对一个名为Qwen2的大型音频语言模型进行优化处理并使其能够符合语音学图像的要求。随后通过一项包含心血管疾病心脏杂音专业标签特征的实验对其进行测试。我们的实验设计目标是创建更为稳健和普遍适用的系统模型，在此过程中采用一种基于音频表现模型的预处理分割算法——SSAMBA。研究结果显示，基于大型语言模型的模型在其中的八项特征上优于当前主流方法，并在其余三项特征上表现相当。更重要的是，大型语言模型在样本数据不足的情况下成功预测了杂音的特征。这表明大型语言模型具有作为心脏病诊断的辅助工具进行应用的潜力。

关键见解

大型语言模型在音频领域具有广泛的应用潜力，特别是在识别和分析人类语音、音乐和环境声音方面表现出色。
在心脏病诊断中，LLM模型在分类心脏杂音特征方面展现出显著优势，能够预测多种声学特征，如时序性、分级度等。
Qwen2等大型音频语言模型的训练数据集需特定化处理以匹配语音图像集要求以提升模型的适用性。
相比传统的神经网络框架，大型语言模型可以在缺乏足够的样本数据时仍然展现出稳定的预测能力。这意味着即使在医疗资源不足的情境中，该模型仍然有应用价值。
利用音频表现模型如SSAMBA的预处理分割算法可进一步提升系统的噪声稳健性和普遍性。这一做法可以增强模型处理实际环境中可能出现的不确定性因素的能力。
本研究中提出的大型语言模型可作为一个有力的工具协助心脏病医生进行准确诊断，并有助于改进心脏病诊断和治疗流程的质量。这可能使得医生的诊断更为高效，同时为患者在治疗中节省了时间和资源。

Cool Papers

点此查看论文截图

Predicting Compact Phrasal Rewrites with Large Language Models for ASR Post Editing

Authors:Hao Zhang, Felix Stahlberg, Shankar Kumar

Large Language Models (LLMs) excel at rewriting tasks such as text style transfer and grammatical error correction. While there is considerable overlap between the inputs and outputs in these tasks, the decoding cost still increases with output length, regardless of the amount of overlap. By leveraging the overlap between the input and the output, Kaneko and Okazaki (2023) proposed model-agnostic edit span representations to compress the rewrites to save computation. They reported an output length reduction rate of nearly 80% with minimal accuracy impact in four rewriting tasks. In this paper, we propose alternative edit phrase representations inspired by phrase-based statistical machine translation. We systematically compare our phrasal representations with their span representations. We apply the LLM rewriting model to the task of Automatic Speech Recognition (ASR) post editing and show that our target-phrase-only edit representation has the best efficiency-accuracy trade-off. On the LibriSpeech test set, our method closes 50-60% of the WER gap between the edit span model and the full rewrite model while losing only 10-20% of the length reduction rate of the edit span model.

大型语言模型（LLM）擅长文本重写任务，如文本风格转换和语法错误修正。虽然这些任务的输入和输出之间存在大量重叠，但解码成本仍会随着输出的增长而增加，无论重叠程度如何。Kaneko和Okazaki（2023年）利用输入和输出之间的重叠，提出了与模型无关的编辑跨度表示法，以压缩重写内容，从而节省计算。他们报告称，在四项重写任务中，输出长度缩减率接近80%，且对精度的影响微乎其微。在本文中，我们受到基于短语的统计机器翻译的启发，提出了另一种编辑短语表示法。我们系统地比较了我们的短语表示法与他们的跨度表示法。我们将LLM重写模型应用于自动语音识别（ASR）的后编辑任务，并证明我们的目标短语仅编辑表示法在效率与准确性方面达到了最佳平衡。在LibriSpeech测试集上，我们的方法缩小了编辑跨度模型和全重写模型之间的词错误率（WER）差距的50-60%，同时只失去了编辑跨度模型的长度缩减率的10-20%。

论文及项目相关链接

PDF accepted by ICASSP 2025

Summary

大型语言模型（LLM）擅长文本改写任务，如文本风格转换和语法错误修正。尽管输入和输出之间存在大量重叠，解码成本仍随输出长度的增加而增加。Kaneko和Okazaki（2023）提出模型无关的编辑跨度表示法，利用输入和输出之间的重叠来压缩重写内容，以节省计算。他们报告称，在四项重写任务中，输出长度缩减率接近80%，且准确性影响极小。本文受基于短语统计机器翻译的启发，提出替代的编辑短语表示法。我们系统地比较了基于短语的表示法与基于跨度的表示法。我们将LLM重写模型应用于自动语音识别（ASR）的后编辑任务，并发现目标短语仅编辑表示法具有最佳的效率-准确性权衡。在LibriSpeech测试集上，我们的方法缩小了编辑跨度模型和全重写模型之间约50-60%的词错误率差距，同时仅损失了编辑跨度模型的10-20%长度缩减率。

Key Takeaways

LLM在文本改写任务上表现出色，如文本风格转换和语法错误修正。
解码成本随输出长度增加而增加，即使输入和输出存在大量重叠。
Kaneko和Okazaki（2023）提出了模型无关的编辑跨度表示法，利用输入和输出的重叠来压缩重写内容。
输出长度缩减率接近80%，同时准确性影响较小。
本文受短语统计机器翻译的启发，提出了编辑短语表示法。
在ASR后编辑任务中，目标短语仅编辑表示法具有最佳效率-准确性权衡。

Cool Papers

点此查看论文截图

Enhancing LLMs for Governance with Human Oversight: Evaluating and Aligning LLMs on Expert Classification of Climate Misinformation for Detecting False or Misleading Claims about Climate Change

Authors:Mowafak Allaham, Ayse D. Lokmanoglu, Sol P. Hart, Erik C. Nisbet

Climate misinformation is a problem that has the potential to be substantially aggravated by the development of Large Language Models (LLMs). In this study we evaluate the potential for LLMs to be part of the solution for mitigating online dis/misinformation rather than the problem. Employing a public expert annotated dataset and a curated sample of social media content we evaluate the performance of proprietary vs. open source LLMs on climate misinformation classification task, comparing them to existing climate-focused computer-assisted tools and expert assessments. Results show (1) state-of-the-art (SOTA) open-source models substantially under-perform in classifying climate misinformation compared to proprietary models, (2) existing climate-focused computer-assisted tools leveraging expert-annotated datasets continues to outperform many of proprietary models, including GPT-4o, and (3) demonstrate the efficacy and generalizability of fine-tuning GPT-3.5-turbo on expert annotated dataset in classifying claims about climate change at the equivalency of climate change experts with over 20 years of experience in climate communication. These findings highlight 1) the importance of incorporating human-oversight, such as incorporating expert-annotated datasets in training LLMs, for governance tasks that require subject-matter expertise like classifying climate misinformation, and 2) the potential for LLMs in facilitating civil society organizations to engage in various governance tasks such as classifying false or misleading claims in domains beyond climate change such as politics and health science.

关于气候误信息的处理是一个可能会因大型语言模型（LLM）的发展而进一步加剧的问题。在这项研究中，我们评估了LLM作为解决方案的一部分，以减轻在线错误信息的程度，而非加剧问题。我们采用公共专家注释的数据集和社交媒体内容的精选样本，评估专有与开源LLM在气候错误信息分类任务上的表现，并将其与现有的专注于气候的计算机辅助工具和专家评估进行比较。结果表明，（1）最先进的开源模型在气候误信息的分类方面显著落后于专有模型；（2）现有的以气候为重点的计算机辅助工具，利用专家注释的数据集继续超越许多专有模型，包括GPT-4o；（3）证明了在专家注释的数据集上微调GPT-3.5 turbo在气候变化主张分类方面的有效性和可推广性，相当于拥有超过二十年气候传播经验的专家水平。这些发现强调了（1）融入人类监督的重要性，例如在训练LLM时融入专家注释的数据集，用于需要专业知识的治理任务如气候信息错误分类；（2）LLM的潜力在于促进公民社会组织参与各种治理任务，如政治和生命科学等领域虚假或误导性主张的分类。

论文及项目相关链接

PDF Accepted to the AI Governance Workshop at AAAI 2025

Summary：
大型语言模型（LLM）的发展有可能加剧气候谣言问题。本研究评估LLM在缓解在线错误信息方面的作用，而不是加剧问题。通过公共专家注释数据集和社交媒体内容的样本，我们比较了专有与开源LLM在气候谣言分类任务上的性能，以及与现有的气候相关计算机辅助工具和专家评估相比。研究结果表明，先进的开源模型在分类气候谣言方面显著落后于专有模型；现有的气候相关计算机辅助工具仍然优于许多专有模型，包括GPT-4o；精细调整GPT-3.5-turbo在专家注释数据集上可有效且普遍地分类气候变化主张。这些发现强调了融入人类监督（如专家注释数据集）以训练LLM的重要性，特别是在需要专业知识的任务中如分类气候谣言；以及LLM在帮助民间社会组织参与各种治理任务（如分类气候以外领域的虚假或误导性主张）的潜力。

Key Takeaways：

LLM的发展有可能加剧气候谣言问题。
本研究评估LLM在缓解在线错误信息方面的作用。
专有LLM模型在气候谣言分类任务上性能优于开源模型。
现有的气候相关计算机辅助工具表现优于许多专有LLM模型。
精细调整GPT-3.5-turbo在专家注释数据集上可有效分类气候变化主张。
融入人类监督（如专家注释数据集）对于训练LLM至关重要，特别是在需要专业知识的任务中。

Cool Papers

点此查看论文截图

Tune In, Act Up: Exploring the Impact of Audio Modality-Specific Edits on Large Audio Language Models in Jailbreak

Authors:Erjia Xiao, Hao Cheng, Jing Shao, Jinhao Duan, Kaidi Xu, Le Yang, Jindong Gu, Renjing Xu

Large Language Models (LLMs) demonstrate remarkable zero-shot performance across various natural language processing tasks. The integration of multimodal encoders extends their capabilities, enabling the development of Multimodal Large Language Models that process vision, audio, and text. However, these capabilities also raise significant security concerns, as these models can be manipulated to generate harmful or inappropriate content through jailbreak. While extensive research explores the impact of modality-specific input edits on text-based LLMs and Large Vision-Language Models in jailbreak, the effects of audio-specific edits on Large Audio-Language Models (LALMs) remain underexplored. Hence, this paper addresses this gap by investigating how audio-specific edits influence LALMs inference regarding jailbreak. We introduce the Audio Editing Toolbox (AET), which enables audio-modality edits such as tone adjustment, word emphasis, and noise injection, and the Edited Audio Datasets (EADs), a comprehensive audio jailbreak benchmark. We also conduct extensive evaluations of state-of-the-art LALMs to assess their robustness under different audio edits. This work lays the groundwork for future explorations on audio-modality interactions in LALMs security.

大型语言模型（LLM）在各种自然语言处理任务中表现出显著的零样本性能。多模态编码器的集成扩展了它们的功能，推动了能够处理视觉、音频和文本的多模态大型语言模型的发展。然而，这些功能也引发了严重的安全担忧，因为这些模型可以通过越狱生成有害或不当内容而受到操纵。虽然大量研究探讨了模态特定输入编辑对基于文本的大型语言模型和大型视觉语言模型的影响，但针对特定音频编辑对大型音频语言模型（LALM）的影响研究仍然不足。因此，本文通过研究特定音频编辑如何影响LALM关于越狱的推理来填补这一空白。我们引入了音频编辑工具箱（AET），它支持音频模态编辑，如音调调整、单词强调和噪声注入，以及编辑后的音频数据集（EAD），这是一个全面的音频越狱基准测试。我们还对最先进的LALM进行了广泛评估，以测试它们在不同的音频编辑下的稳健性。这项工作为未来在LALM安全性方面探索音频模态交互奠定了基础。

论文及项目相关链接

PDF

Summary

大型语言模型（LLM）展现出跨多种自然语言处理任务的出色零样本性能。通过融入多模态编码器，进一步开发出能够处理视觉、音频和文本的多模态大型语言模型。然而，这些能力也引发了严重的安全担忧，因为模型可能会被操纵以生成有害或不当内容。尽管已有大量研究探讨了模态特定输入编辑对文本型LLM和大型视觉语言模型的影响，但针对音频特定编辑对大型音频语言模型（LALM）的影响的研究仍显不足。本文旨在填补这一空白，研究音频特定编辑如何影响LALM在越狱场景中的推断。引入了音频编辑工具箱（AET）和编辑音频数据集（EADs），并对当前先进的LALM进行了广泛评估，以测试其在不同音频编辑下的稳健性。

Key Takeaways

大型语言模型展现出跨多种自然语言处理任务的出色零样本性能。
多模态编码器的融入扩展了大型语言模型的能力，包括处理视觉、音频和文本。
大型语言模型存在被操纵生成有害或不当内容的安全风险。
对音频特定编辑影响大型音频语言模型（LALM）的研究仍然不足。
引入音频编辑工具箱（AET）和编辑音频数据集（EADs）为研究提供工具和数据支持。
广泛评估显示，当前先进的LALM在不同音频编辑下的稳健性有待提升。

Cool Papers

点此查看论文截图

UGMathBench: A Diverse and Dynamic Benchmark for Undergraduate-Level Mathematical Reasoning with Large Language Models

Authors:Xin Xu, Jiaxin Zhang, Tianhao Chen, Zitong Chao, Jishan Hu, Can Yang

Large Language Models (LLMs) have made significant strides in mathematical reasoning, underscoring the need for a comprehensive and fair evaluation of their capabilities. However, existing benchmarks often fall short, either lacking extensive coverage of undergraduate-level mathematical problems or probably suffering from test-set contamination. To address these issues, we introduce UGMathBench, a diverse and dynamic benchmark specifically designed for evaluating undergraduate-level mathematical reasoning with LLMs. UGMathBench comprises 5,062 problems across 16 subjects and 111 topics, featuring 10 distinct answer types. Each problem includes three randomized versions, with additional versions planned for release as leading open-source LLMs become saturated in UGMathBench. Furthermore, we propose two key metrics: effective accuracy (EAcc), which measures the percentage of correctly solved problems across all three versions, and reasoning gap ($\Delta$), which assesses reasoning robustness by calculating the difference between the average accuracy across all versions and EAcc. Our extensive evaluation of 23 leading LLMs reveals that the highest EAcc achieved is 56.3% by OpenAI-o1-mini, with large $\Delta$ values observed across different models. This highlights the need for future research aimed at developing “large reasoning models” with high EAcc and $\Delta = 0$. We anticipate that the release of UGMathBench, along with its detailed evaluation codes, will serve as a valuable resource to advance the development of LLMs in solving mathematical problems.

大型语言模型（LLM）在数学推理方面取得了显著进展，这强调了对它们的能力进行全面公平评估的必要性。然而，现有的基准测试通常存在不足，要么没有广泛覆盖本科数学问题和可能存在的测试集污染。为了解决这些问题，我们推出了UGMathBench，这是一个专门设计用于评估本科数学推理的大型语言模型（LLM）的多元动态基准测试。UGMathBench包含5,062个跨越16个学科和111个主题的问题，包含10种不同的答案类型。每个问题包括三个随机版本，随着领先的开源大型语言模型在UGMathBench中趋于饱和，我们还计划发布更多版本。此外，我们提出了两个关键指标：有效准确率（EAcc），用于衡量所有三个版本中正确解决的问题的百分比；推理差距（Δ），通过计算所有版本平均准确率和EAcc之间的差异来评估推理稳健性。我们对23个领先的大型语言模型的广泛评估表明，OpenAI-o1-mini的最高有效准确率达到了56.3%，不同模型之间观察到较大的Δ值。这强调了未来研究需要致力于开发具有高有效准确率和Δ=0的“大型推理模型”。我们预期，UGMathBench及其详细的评估代码的发布将作为推动大型语言模型解决数学问题的重要资源。

论文及项目相关链接

PDF Accepted to ICLR 2025

Summary

LLMs在数学推理方面取得显著进展，但现有评估标准存在不足。为此，提出UGMathBench评估平台，涵盖5062个问题和多种题型，旨在全面公平地评估LLM解决本科数学问题的能力。引入有效准确度（EAcc）和推理差距（Δ）两个关键指标，对现有23个LLM模型进行评估，发现存在提升空间。期待UGMathBench及其评估代码能促进LLM在数学问题解答方面的进一步发展。

Key Takeaways

LLMs在数学推理能力上取得显著进步，需要全面公平的评估标准。
UGMathBench是一个专为评估LLM解决本科数学问题能力的平台，包含多样化和动态的问题库。
引入有效准确度（EAcc）和推理差距（Δ）两个关键指标来评估LLM的推理能力。
UGMathBench对现有23个LLM模型进行了评估，发现最高有效准确度为OpenAI-o1-mini的56.3%，存在提升空间。
需要进一步研究和开发具有高水平有效准确度和零推理差距的“大型推理模型”。

Cool Papers

点此查看论文截图

Authors:Yang Bai, Christan Earl Grant, Daisy Zhe Wang

Multi-modal retrieval-augmented Question Answering (MRAQA), integrating text and images, has gained significant attention in information retrieval (IR) and natural language processing (NLP). Traditional ranking methods rely on small encoder-based language models, which are incompatible with modern decoder-based generative large language models (LLMs) that have advanced various NLP tasks. To bridge this gap, we propose RAMQA, a unified framework combining learning-to-rank methods with generative permutation-enhanced ranking techniques. We first train a pointwise multi-modal ranker using LLaVA as the backbone. Then, we apply instruction tuning to train a LLaMA model for re-ranking the top-k documents using an innovative autoregressive multi-task learning approach. Our generative ranking model generates re-ranked document IDs and specific answers from document candidates in various permutations. Experiments on two MRAQA benchmarks, WebQA and MultiModalQA, show significant improvements over strong baselines, highlighting the effectiveness of our approach. Code and data are available at: https://github.com/TonyBY/RAMQA

多模态检索增强问答（MRAQA）融合了文本和图像，在信息检索（IR）和自然语言处理（NLP）领域引起了广泛关注。传统排名方法依赖于基于编码器的小型语言模型，这些模型与现代基于解码器的生成式大型语言模型（LLM）不兼容，而LLM已在各种NLP任务中取得了进展。为了弥补这一差距，我们提出了RAMQA，这是一个结合了学习排名方法和生成置换增强排名技术的统一框架。我们首先使用LLaVA作为骨干训练点式多模态排名器。然后，我们应用指令调整来训练一个LLaMA模型，使用创新的自回归多任务学习方法对前k个文档进行重新排名。我们的生成排名模型从文档候选者中生成重新排名的文档ID和特定答案，这些文档以各种排列方式呈现。在WebQA和多模态QA两个MRAQA基准测试上的实验表明，与强大的基线相比，我们的方法有明显的改进，突出了我们的方法的有效性。相关代码和数据可在https://github.com/TonyBY/RAMQA找到。

论文及项目相关链接

PDF Accepted by NAACL 2025 Findings

Summary

多模态检索增强问答（MRAQA）融合了文本和图像，在信息检索（IR）和自然语言处理（NLP）领域引起了广泛关注。传统排序方法依赖于小型编码器基础的语言模型，与现代解码器基础的大型语言模型（LLM）不兼容。本文提出RAMQA框架，结合学习排序方法与生成排列增强排序技术。通过以LLaVA为骨干进行点对多模态排序器训练，再应用指令微调训练LLaMA模型，以创新自回归多任务学习方式进行前k个文档重新排序。生成式排名模型能够生成重新排序的文档ID和来自文档候选者的特定答案，在各种排列中表现优异。在WebQA和MultiModalQA两个MRAQA基准测试上，相较于强基线有显著改进，凸显了该方法的有效性。

Key Takeaways

MRAQA融合了文本和图像，在信息检索和自然语言处理中受到重视。
传统排序方法依赖小型编码器语言模型，与大型语言模型不兼容。
RAMQA框架结合了学习排序和生成排列增强排序技术。
使用LLaVA作为骨干进行点对多模态排序器训练。
通过指令微调训练LLaMA模型进行文档重新排序。
生成式排名模型能生成重新排序的文档ID和特定答案。
在MRAQA基准测试上，RAMQA相比基线有显著改善。

Cool Papers

点此查看论文截图

ARTEMIS-DA: An Advanced Reasoning and Transformation Engine for Multi-Step Insight Synthesis in Data Analytics

Authors:Atin Sakkeer Hussain

This paper presents the Advanced Reasoning and Transformation Engine for Multi-Step Insight Synthesis in Data Analytics (ARTEMIS-DA), a novel framework designed to augment Large Language Models (LLMs) for solving complex, multi-step data analytics tasks. ARTEMIS-DA integrates three core components: the Planner, which dissects complex user queries into structured, sequential instructions encompassing data preprocessing, transformation, predictive modeling, and visualization; the Coder, which dynamically generates and executes Python code to implement these instructions; and the Grapher, which interprets generated visualizations to derive actionable insights. By orchestrating the collaboration between these components, ARTEMIS-DA effectively manages sophisticated analytical workflows involving advanced reasoning, multi-step transformations, and synthesis across diverse data modalities. The framework achieves state-of-the-art (SOTA) performance on benchmarks such as WikiTableQuestions and TabFact, demonstrating its ability to tackle intricate analytical tasks with precision and adaptability. By combining the reasoning capabilities of LLMs with automated code generation and execution and visual analysis, ARTEMIS-DA offers a robust, scalable solution for multi-step insight synthesis, addressing a wide range of challenges in data analytics.

本文介绍了用于数据解析中的多步骤见解合成的先进推理与转换引擎（ARTEMIS-DA），这是一种旨在增强大型语言模型（LLM）以解决复杂的多步骤数据分析任务的新型框架。ARTEMIS-DA集成了三个核心组件：Planner，它将复杂的用户查询划分为结构化、顺序指令，包括数据预处理、转换、预测建模和可视化；Coder，它动态生成并执行Python代码以执行这些指令；以及Grapher，它解释生成的可视化以获取可操作见解。通过协调这些组件之间的协作，ARTEMIS-DA有效地管理涉及高级推理、多步骤转换和跨不同数据模态的合成的高级分析工作流程。该框架在WikiTableQuestions和TabFact等基准测试上实现了最先进的性能，证明了其精确度和适应性处理复杂分析任务的能力。通过将LLM的推理能力与自动化代码生成和执行以及可视化分析相结合，ARTEMIS-DA为多步骤见解合成提供了稳健且可扩展的解决方案，解决了数据分析中的广泛挑战。

论文及项目相关链接

PDF

Summary

ARTEMIS-DA框架是一种先进的用于多步骤洞察合成数据分析的高级推理与转换引擎。它通过整合规划器、编码器和绘图器三个核心组件，实现复杂的多步骤数据分析任务。ARTEMIS-DA有效管理高级推理、多步骤转换和跨不同数据模态的合成等复杂分析工作流程，并在WikiTableQuestions和TabFact等基准测试中实现最先进的性能表现。它将大型语言模型的推理能力与自动化代码生成和执行以及可视化分析相结合，为复杂的数据分析任务提供了稳健、可扩展的解决方案。

Key Takeaways

ARTEMIS-DA是一个旨在增强大型语言模型（LLMs）解决复杂多步骤数据分析任务的新型框架。
它包含三个核心组件：规划器负责将复杂用户查询分解为结构化顺序指令，编码器动态生成并执行Python代码实现这些指令，而绘图器则解释生成的可视化以获取可操作见解。
ARTEMIS-DA通过协同这些组件，有效管理涉及高级推理、多步骤转换和跨不同数据模态的合成等复杂分析工作流程。
ARTEMIS-DA在WikiTableQuestions和TabFact等基准测试中实现了最先进的性能。
该框架将LLMs的推理能力与自动化代码生成和执行以及可视化分析相结合，为复杂的数据分析任务提供了全面解决方案。

Cool Papers

点此查看论文截图

Interactive Cycle Model: The Linkage Combination among Automatic Speech Recognition, Large Language Models and Smart Glasses

Authors:Libo Wang

This research proposes the interaction loop model “ASR-LLMs-Smart Glasses”, which model combines automatic speech recognition, large language model and smart glasses to facilitate seamless human-computer interaction. And the methodology of this research involves decomposing the interaction process into different stages and elements. Speech is captured and processed by ASR, then analyzed and interpreted by LLMs. The results are then transmitted to smart glasses for display. The feedback loop is complete when the user interacts with the displayed data. Mathematical formulas are used to quantify the performance of the model that revolves around core evaluation points: accuracy, coherence, and latency during ASR speech-to-text conversion. The research results are provided theoretically to test and evaluate the feasibility and performance of the model. Detailed architectural details and experimental process have been uploaded to Github, the link is:https://github.com/brucewang123456789/GeniusTrail.git.

本研究提出了“ASR-LLMs-智能眼镜”交互循环模型，该模型结合了自动语音识别、大型语言模型和智能眼镜，促进了无缝的人机交互。本研究的方法论涉及将交互过程分解成不同的阶段和元素。语音被ASR捕获并处理，然后被LLMs分析和解释。结果随后传输到智能眼镜进行显示。当用户与显示的数据交互时，反馈循环完成。本研究使用数学公式来量化模型性能，围绕核心评估点：ASR语音转文本转换过程中的准确性、连贯性和延迟。研究结果在理论上提供了测试和评估该模型可行性和性能的依据。详细的架构细节和实验过程已经上传到Github，链接为：https://github.com/brucewang123456789/GeniusTrail.git。

论文及项目相关链接

PDF OpenReview submitted. 10 pages of text and 2 figures

Summary

该研究提出了“ASR-LLMs-智能眼镜”交互循环模型，该模型结合了自动语音识别、大型语言模型和智能眼镜，促进了无缝的人机交互。研究方法将交互过程分解成不同的阶段和元素。语音由ASR捕获并处理，然后通过LLMs进行分析和解释。结果传输到智能眼镜进行显示。用户与显示的数据进行交互时，反馈循环完成。围绕核心评估点（准确性、连贯性和ASR语音转文本的延迟），使用数学公式对模型性能进行量化。研究成果已在理论上进行了测试并评估了模型的可行性及性能。详细架构和实验过程已上传至Github。有关链接：https://github.com/brucewang123456789/GeniusTrail.git。

Key Takeaways

研究提出了ASR-LLMs-智能眼镜交互循环模型，整合自动语音识别、大型语言模型和智能眼镜技术。
模型实现了无缝人机互动，通过分解交互过程为不同阶段和元素来优化交互体验。
语音先由ASR处理，然后通过LLMs进行分析和解释，最后将结果展示在智能眼镜上。
模型性能通过数学公式量化，核心评估标准包括准确性、连贯性和延迟。
该模型已经在理论上得到了测试，验证了其可行性和性能表现。
详细的模型架构和实验过程已经公开在Github上供公众查阅。

Cool Papers

点此查看论文截图

Reducing Reasoning Costs: The Path of Optimization for Chain of Thought via Sparse Attention Mechanism

Authors:Libo Wang

In order to address the chain of thought in the large language model inference cost surge, this research proposes to use a sparse attention mechanism that only focuses on a few relevant tokens. The researcher constructed a new attention mechanism and used GiantRabbit trained with custom GPTs as an experimental tool. The experiment tested and compared the reasoning time, correctness score and chain of thought length of this model and o1 Preview in solving the linear algebra test questions of MIT OpenCourseWare. The results show that GiantRabbit’s reasoning time and chain of thought length are significantly lower than o1 Preview. It verifies the feasibility of sparse attention mechanism for optimizing chain of thought reasoning. Detailed architectural details and experimental process have been uploaded to Github, the link is:https://github.com/brucewang123456789/GeniusTrail.git.

为了应对大型语言模型推理成本飙升中的思维链问题，本研究提出了一种稀疏注意力机制，该机制只关注少数相关令牌。研究者构建了一种新的注意力机制，并使用GiantRabbit作为实验工具，GiantRabbit通过自定义GPT进行了训练。实验测试并比较了该模型与o1 Preview在解决MIT OpenCourseWare的线性代数测试问题时的推理时间、正确率和思维链长度。结果表明，GiantRabbit的推理时间和思维链长度明显低于o1 Preview。这验证了稀疏注意力机制在优化思维链推理方面的可行性。详细的架构细节和实验过程已上传到Github，链接为：https://github.com/brucewang123456789/GeniusTrail.git。

论文及项目相关链接

PDF The main text is 5 pages, totaling 9 pages; 4 figures, 1 table. It have been submitted to NeurIPS 2024 Workshop MusIML and OpenReview

Summary
大型语言模型推理成本激增的问题得到了解决。研究提出了稀疏注意力机制来关注关键内容。通过实验发现，新型注意力机制的推理时间和推理思路长度相较于预览都明显更优，且可行。更具体的方案和实验结果在GitHub中给出。详情请参见GitHub链接（https://github.com/brucewang123456789/GeniusTrail.git）。简洁地表达出来的文章主要内容如下：Sparse Attention机制提高了大语言模型的推理效率；GiantRabbit模型表现优异；GitHub链接提供了详细信息。

Key Takeaways

以下是基于文本提取的七个关键要点：

研究者提出使用稀疏注意力机制来解决大型语言模型推理成本上升的问题。
研究中设计了一种新型注意力机制并通过GiantRabbit模型进行实验验证。
GiantRabbit模型是用定制的GPT训练的，作为实验工具进行推理性能测试。
实验通过对比GiantRabbit模型和o1 Preview在解决MIT OpenCourseWare的线性代数测试问题上的表现来评估模型性能。
实验结果显示GiantRabbit模型的推理时间和推理思路长度均显著低于o1 Preview。
实验验证了稀疏注意力机制在优化推理思路方面的可行性。

Cool Papers

点此查看论文截图

DART: Denoising Autoregressive Transformer for Scalable Text-to-Image Generation

Authors:Jiatao Gu, Yuyang Wang, Yizhe Zhang, Qihang Zhang, Dinghuai Zhang, Navdeep Jaitly, Josh Susskind, Shuangfei Zhai

Diffusion models have become the dominant approach for visual generation. They are trained by denoising a Markovian process which gradually adds noise to the input. We argue that the Markovian property limits the model’s ability to fully utilize the generation trajectory, leading to inefficiencies during training and inference. In this paper, we propose DART, a transformer-based model that unifies autoregressive (AR) and diffusion within a non-Markovian framework. DART iteratively denoises image patches spatially and spectrally using an AR model that has the same architecture as standard language models. DART does not rely on image quantization, which enables more effective image modeling while maintaining flexibility. Furthermore, DART seamlessly trains with both text and image data in a unified model. Our approach demonstrates competitive performance on class-conditioned and text-to-image generation tasks, offering a scalable, efficient alternative to traditional diffusion models. Through this unified framework, DART sets a new benchmark for scalable, high-quality image synthesis.

扩散模型已经成为视觉生成的主要方法。它们通过去噪马尔可夫过程进行训练，该过程逐步向输入添加噪声。我们认为，马尔可夫属性限制了模型充分利用生成轨迹的能力，从而在训练和推理过程中造成效率低下。在本文中，我们提出了DART，这是一个基于变压器的模型，它在非马尔可夫框架内统一了自回归（AR）和扩散。DART使用与标准语言模型相同的架构的AR模型，在空间上光谱上迭代去噪图像块。DART不依赖于图像量化，能够在保持灵活性的同时实现更有效的图像建模。此外，DART可以在统一模型中无缝地训练文本和图像数据。我们的方法在类条件生成和文本到图像生成任务上表现出有竞争力的性能，为传统扩散模型提供了可扩展且高效的替代方案。通过这一统一框架，DART为可扩展的高质量图像合成设定了新的基准。

论文及项目相关链接

PDF Accepted by ICLR2025

Summary

本文提出一种基于非马尔可夫框架的统一扩散模型DART，结合了自回归（AR）和扩散模型。DART通过空间上和光谱上的图像块迭代去噪，使用与标准语言模型相同的架构的自回归模型。它不需要依赖图像量化，能在保持灵活性的同时更有效地进行图像建模。此外，DART可以在统一模型中无缝地结合文本和图像数据进行训练。该方法在类条件生成和文本到图像生成任务上表现出竞争力，成为可扩展、高效的传统扩散模型的替代品，为可伸缩高质量图像合成设定了新的基准。

Key Takeaways

扩散模型已成为视觉生成的主导方法，但其Markovian属性限制了模型在生成轨迹上的完全利用能力，导致训练和推断效率低下。
DART是一个基于非马尔可夫框架的统一模型，结合了自回归（AR）和扩散模型。
DART通过空间上和光谱上的图像块迭代去噪，采用与标准语言模型相同的架构的自回归模型。
DART不需要依赖图像量化，可以更有效地进行图像建模并保持灵活性。
DART能在统一模型中无缝结合文本和图像数据进行训练。
DART在类条件生成和文本到图像生成任务上具有竞争力。

Cool Papers

点此查看论文截图

NESTFUL: A Benchmark for Evaluating LLMs on Nested Sequences of API Calls

Authors:Kinjal Basu, Ibrahim Abdelaziz, Kiran Kate, Mayank Agarwal, Maxwell Crouse, Yara Rizk, Kelsey Bradford, Asim Munawar, Sadhana Kumaravel, Saurabh Goyal, Xin Wang, Luis A. Lastras, Pavan Kapanipathi

The resurgence of autonomous agents built using large language models (LLMs) to solve complex real-world tasks has brought increased focus on LLMs’ fundamental ability of tool or function calling. At the core of these agents, an LLM must plan, execute, and respond using external tools, APIs, and custom functions. Research on tool calling has gathered momentum, but evaluation benchmarks and datasets representing the complexity of the tasks have lagged behind. In this work, we focus on one such complexity, nested sequencing, with the goal of extending existing benchmarks and evaluation. Specifically, we present NESTFUL, a benchmark to evaluate LLMs on nested sequences of API calls, i.e., sequences where the output of one API call is passed as input to a subsequent call. NESTFUL contains 1800+ nested sequences where all the function calls are executable. Experimental results on multiple models and settings show that the best-performing model on the dataset has a full sequence match accuracy of 25% and win-rate of 34% necessitating a large scope for improvement in the nested sequencing aspect of function calling. Our analysis of these results provides possible future research directions for the community, in addition to a benchmark to track progress. We have released the NESTFUL dataset under the Apache 2.0 license at https://github.com/IBM/NESTFUL.

使用大型语言模型（LLM）构建的自主代理人的复苏，用于解决复杂的现实世界任务，这增加了对LLM基本工具或功能调用能力的关注。这些代理人的核心，LLM必须利用外部工具、API和自定义函数进行规划、执行和响应。关于工具调用的研究已经蓄势待发，但是代表任务复杂度的评估基准和数据集却滞后了。在这项工作中，我们专注于其中一种复杂性，即嵌套序列，旨在扩展现有的基准评估和数据集。具体来说，我们提出了NESTFUL基准评估，用于评估LLM在API调用的嵌套序列方面的表现，即一个API调用的输出作为后续调用的输入。NESTFUL包含超过1800个嵌套序列，其中所有函数调用都是可执行的。在多模型和设置上的实验结果表明，该数据集上表现最佳的模型的完整序列匹配准确率为25%，胜率为34%，说明在函数调用中的嵌套序列方面还有很大的改进空间。我们对这些结果的分析为社区提供了可能的未来研究方向，此外还提供了一个基准评估来跟踪进度。我们已在Apache 2.0许可证下发布了NESTFUL数据集：https://github.com/IBM/NESTFUL。

论文及项目相关链接

PDF

Summary

本文关注大型语言模型（LLM）在解决复杂现实任务中的工具调用能力。针对现有评估基准和数据集在任务复杂性方面的不足，提出了一种新的评估基准NESTFUL，用于评估LLM在处理嵌套序列API调用方面的性能。实验结果表明，现有模型在这方面仍有待提高，为未来研究提供了方向。同时，本文也分享了NESTFUL数据集以便跟踪进展。

Key Takeaways

大型语言模型（LLM）在解决复杂现实任务中需要调用工具、API和自定义函数。
NESTFUL是一个用于评估LLM在处理嵌套序列API调用性能的基准测试。
NESTFUL包含1800多个可执行的嵌套序列。
目前最佳模型在NESTFUL数据集上的全序列匹配准确率为25%，胜率为34%，表明在这方面仍有很大改进空间。
本文分析了实验结果，为未来的研究提供了方向。
NESTFUL数据集已经发布，可供研究社区使用以跟踪进展。

Cool Papers

点此查看论文截图

PhishAgent: A Robust Multimodal Agent for Phishing Webpage Detection

Authors:Tri Cao, Chengyu Huang, Yuexin Li, Huilin Wang, Amy He, Nay Oo, Bryan Hooi

Phishing attacks are a major threat to online security, exploiting user vulnerabilities to steal sensitive information. Various methods have been developed to counteract phishing, each with varying levels of accuracy, but they also face notable limitations. In this study, we introduce PhishAgent, a multimodal agent that combines a wide range of tools, integrating both online and offline knowledge bases with Multimodal Large Language Models (MLLMs). This combination leads to broader brand coverage, which enhances brand recognition and recall. Furthermore, we propose a multimodal information retrieval framework designed to extract the relevant top k items from offline knowledge bases, using available information from a webpage, including logos and HTML. Our empirical results, based on three real-world datasets, demonstrate that the proposed framework significantly enhances detection accuracy and reduces both false positives and false negatives, while maintaining model efficiency. Additionally, PhishAgent shows strong resilience against various types of adversarial attacks.

网络钓鱼攻击是对网络安全的主要威胁之一，它通过利用用户漏洞来窃取敏感信息。尽管已经开发了各种方法来对抗网络钓鱼攻击，每种方法都有不同级别的准确性，但它们也面临着明显的局限性。本研究介绍了PhishAgent，这是一个多模式代理，它结合了多种工具，融合了在线和离线知识库与多模式大型语言模型（MLLMs）。这种结合导致了更广泛的品牌覆盖，从而提高了品牌识别和回忆。此外，我们提出了一种多模式信息检索框架，旨在从离线知识库中提取与网页上可用的信息相关的前k个项目，包括标志和HTML。我们的基于三个真实数据集的实证结果表明，该框架大大提高了检测准确性，降低了误报和漏报的数量，同时保持了模型的效率。此外，PhishAgent对各种对抗性攻击表现出强大的抵抗力。

论文及项目相关链接

PDF Accepted at AAAI 2025 (Oral)

Summary
自动化防护受到网络安全领域各种钓邮攻击的威胁影响深远。此研究引入PhishAgent，结合多种工具和多模态大型语言模型（MLLMs），提高品牌覆盖范围和识别能力，更有效地检测和防护钓邮攻击。实证结果表明，该框架显著提高检测准确性并减少误报和漏报，同时保持模型效率。此外，PhishAgent对抗多种对抗性攻击表现出强大的韧性。

Key Takeaways

Phishing攻击是网络安全的主要威胁，旨在利用用户漏洞窃取敏感信息。
当前反钓邮方法存在局限性，需要更有效的解决方案。
PhishAgent是一个多模态agent，结合在线和离线知识库与多模态大型语言模型（MLLMs）。
结合多种工具增强品牌覆盖范围和识别能力。
多模态信息检索框架从离线知识库中提取最相关的前k项信息。
基于三个真实数据集进行的实证研究证明PhishAgent能提高检测准确性并减少误报和漏报。

Cool Papers

点此查看论文截图

RegMix: Data Mixture as Regression for Language Model Pre-training

Authors:Qian Liu, Xiaosen Zheng, Niklas Muennighoff, Guangtao Zeng, Longxu Dou, Tianyu Pang, Jing Jiang, Min Lin

The data mixture for large language model pre-training significantly impacts performance, yet how to determine an effective mixture remains unclear. We propose RegMix to automatically identify a high-performing data mixture by formulating it as a regression task. RegMix trains many small models on diverse data mixtures, uses regression to predict performance of unseen mixtures, and applies the best predicted mixture to train a large-scale model with orders of magnitude more compute. To empirically validate RegMix, we train 512 models with 1M parameters for 1B tokens to fit the regression model and predict the best data mixture. Using this mixture we train a 1B parameter model for 25B tokens (i.e. 1000x larger and 25x longer) which we find performs best among 64 candidate 1B parameter models with other mixtures. Furthermore, RegMix consistently outperforms human selection in experiments involving models up to 7B models trained on 100B tokens, while matching or exceeding DoReMi using just 10% of the computational resources. Our experiments also show that (1) Data mixtures significantly impact performance; (2) Web corpora rather than data perceived as high-quality like Wikipedia have the strongest positive correlation with downstream performance; (3) Domains interact in complex ways often contradicting common sense, thus automatic approaches like RegMix are needed; (4) Data mixture effects transcend scaling laws. Our code is available at https://github.com/sail-sg/regmix.

数据混合对于大型语言模型的预训练性能有着显著影响，然而如何确定有效的数据混合仍不明确。我们提出RegMix方法，通过将其制定为回归任务来自动确定高性能的数据混合。RegMix在多种数据混合上训练许多小型模型，使用回归来预测未见过的数据混合的性能，并将最佳的预测数据混合应用于训练具有大量计算资源的大型模型。为了实证验证RegMix，我们训练了512个具有1M参数、训练标记为1B的模型来拟合回归模型并预测最佳数据混合。使用这种数据混合，我们训练了一个具有1B参数的模型，进行了高达25B标记的训练（即比具有其他混合物的其他最佳模型规模大10倍，训练时间为其三倍），并且我们发现该模型性能最佳。此外，RegMix在实验中的表现一直优于人工选择方式，即使在涉及高达拥有被训练标记为高达数十亿的更大模型的实验中也是如此。同时，我们的方法匹配或超过了DoReMi方法所使用的计算资源只有其十分之一。我们的实验还表明：（1）数据混合显著影响性能；（2）相较于被认为高质量的维基百科，网络语料库与下游任务的性能之间存在最强的正向关联；（3）各个领域之间的互动错综复杂常常有悖常理常识因此需要进行像RegMix一样的自动化方式来进行评估；（4）数据混合的影响超越了规模定律。我们的代码可在https://github.com/sail-sg/regmix上获取查阅。

论文及项目相关链接

PDF ICLR 2025

Summary

本文探讨了大型语言模型预训练数据混合的重要性及其对模型性能的影响。针对如何确定有效数据混合的问题，提出了一种名为RegMix的自动方法。RegMix通过回归任务来预测不同数据混合的性能，并使用最佳预测结果来训练大型模型。实证研究表明，RegMix在多个实验中表现优秀，与人工选择相比具有显著优势，同时匹配或超越了DoReMi方法且仅使用了其十分之一的计算资源。此外，实验还揭示了一些重要见解，如数据混合对性能有重要影响，网络语料库相较于如维基百科等高质量数据对下游性能有更强正面影响等。

Key Takeaways

数据混合对大型语言模型的性能具有显著影响。
RegMix方法能够自动确定高性能的数据混合。
通过回归任务，RegMix预测不同数据混合的性能表现。
网络语料库与下游性能之间有强正相关关系。
模型领域间的相互作用复杂，需要自动方法如RegMix来处理。
数据混合的影响超越了规模定律。

Cool Papers

点此查看论文截图

Evaluating LLMs for Quotation Attribution in Literary Texts: A Case Study of LLaMa3

Authors:Gaspard Michel, Elena V. Epure, Romain Hennequin, Christophe Cerisara

Large Language Models (LLMs) have shown promising results in a variety of literary tasks, often using complex memorized details of narration and fictional characters. In this work, we evaluate the ability of Llama-3 at attributing utterances of direct-speech to their speaker in novels. The LLM shows impressive results on a corpus of 28 novels, surpassing published results with ChatGPT and encoder-based baselines by a large margin. We then validate these results by assessing the impact of book memorization and annotation contamination. We found that these types of memorization do not explain the large performance gain, making Llama-3 the new state-of-the-art for quotation attribution in English literature. We release publicly our code and data.

大型语言模型（LLM）在各种文学任务中表现出了令人瞩目的结果，通常利用对叙事和虚构角色的复杂记忆细节。在这项工作中，我们评估了Llama-3在小说中将直接言语归于说话人的能力。该语言模型在28部小说的语料库上表现出令人印象深刻的结果，大幅度超越了ChatGPT和基于编码器的基线模型的已发布结果。然后，我们通过评估书籍记忆和注释污染的影响来验证这些结果。我们发现，这种类型的记忆并不能解释性能的大幅提升，这使得Llama-3成为英语文学中引语归属的新技术顶尖水平。我们公开发布了我们的代码和数据。

论文及项目相关链接

PDF NAACL 2025 Main Conference – short paper

Summary

LLMs如Llama-3在小说直接引语的说话者归属任务中展现出卓越性能，对28部小说的语料库结果令人印象深刻，超越了ChatGPT和基于编码器的基线模型。验证结果显示，书本记忆和标注污染并不足以解释其性能的大幅提升，Llama-3成为英语文学引用归属任务的新标杆。

Key Takeaways

Llama-3在小说直接引语的说话者归属任务中表现出卓越性能。
Llama-3在28部小说的语料库上的表现超越了ChatGPT和其他基线模型。
验证结果显示书本记忆和标注污染并不足以解释Llama-3的性能提升。
Llama-3成为英语文学引用归属任务的新标杆。
该研究公开了代码和数据，有助于进一步研究和改进LLM技术。
该研究展示了LLM在文学任务中的复杂记忆能力，包括叙事和角色细节的复杂记忆。

Cool Papers

点此查看论文截图

IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models

Authors:David Ifeoluwa Adelani, Jessica Ojo, Israel Abebe Azime, Jian Yun Zhuang, Jesujoba O. Alabi, Xuanli He, Millicent Ochieng, Sara Hooker, Andiswa Bukula, En-Shiun Annie Lee, Chiamaka Chukwuneke, Happy Buzaaba, Blessing Sibanda, Godson Kalipe, Jonathan Mukiibi, Salomon Kabongo, Foutse Yuehgoh, Mmasibidi Setaka, Lolwethu Ndolela, Nkiruka Odu, Rooweither Mabuya, Shamsuddeen Hassan Muhammad, Salomey Osei, Sokhar Samb, Tadesse Kebede Guge, Tombekai Vangoni Sherman, Pontus Stenetorp

Despite the widespread adoption of Large language models (LLMs), their remarkable capabilities remain limited to a few high-resource languages. Additionally, many low-resource languages (\eg African languages) are often evaluated only on basic text classification tasks due to the lack of appropriate or comprehensive benchmarks outside of high-resource languages. In this paper, we introduce IrokoBench – a human-translated benchmark dataset for 17 typologically-diverse low-resource African languages covering three tasks: natural language inference~~(AfriXNLI), mathematical reasoning~~(AfriMGSM), and multi-choice knowledge-based question answering~~(AfriMMLU). We use IrokoBench to evaluate zero-shot, few-shot, and translate-test settings~~(where test sets are translated into English) across 10 open and six proprietary LLMs. Our evaluation reveals a significant performance gap between high-resource languages~(such as English and French) and low-resource African languages. We observe a significant performance gap between open and proprietary models, with the highest performing open model, Gemma 2 27B only at 63% of the best-performing proprietary model GPT-4o performance. In addition, machine translating the test set to English before evaluation helped to close the gap for larger models that are English-centric, such as Gemma 2 27B and LLaMa 3.1 70B. These findings suggest that more efforts are needed to develop and adapt LLMs for African languages.

尽管大型语言模型（LLM）得到了广泛应用，但它们的出色功能仅限于少数高资源语言。此外，由于高资源语言之外缺乏适当或全面的基准测试，许多低资源语言（例如非洲语言）通常仅在基本的文本分类任务上进行评估。在本文中，我们介绍了IrokoBench——一个为17种类型多样且资源匮乏的非洲语言提供的人类翻译基准数据集，包含三个任务：自然语言推理（AfriXNLI）、数学推理（AfriMGSM）和基于多选知识的问答（AfriMMLU）。我们使用IrokoBench对零样本、少样本以及将测试集翻译至英语的翻译测试环境进行了评估，涉及10个开源和六个专有LLM。我们的评估揭示了高资源语言（如英语和法语）与资源匮乏的非洲语言之间的显著性能差距。我们还观察到开源模型和专有模型之间的显著性能差距，表现最佳的开源模型Gemma 2 27B仅达到最佳专有模型GPT-4o性能的63%。此外，在评估之前将测试集机器翻译至英语有助于缩小以英语为中心的较大模型的差距，如Gemma 2 27B和LLaMa 3.1 70B。这些发现表明，需要更多的努力来开发和适应非洲语言的LLM。

论文及项目相关链接

PDF Accepted to NAACL 2025 (main conference)

Summary

本文介绍了IrokoBench——一个为17种语言形态各异的低资源非洲语言设计的人类翻译基准数据集。该数据集涵盖了三项任务：自然语言推理、数学推理和基于知识的问题回答。通过对零样本、少样本和翻译测试环境的评估，发现非洲语言与资源丰富的语言之间存在显著的性能差距。机器翻译测试集到英语后评估有助于缩小与英语为中心的大型模型的差距。这显示出需要更多努力来开发和适应非洲语言的LLM。

Key Takeaways

IrokoBench是一个针对非洲语言的基准数据集，包含自然语言推理、数学推理和知识问答三项任务。
数据集涵盖17种语言形态各异的低资源非洲语言。
在零样本、少样本和翻译测试环境中评估LLM性能时发现非洲语言与资源丰富的语言之间存在显著差距。
开放模型和专有模型之间存在显著性能差距，最佳开放模型仅达到最佳专有模型性能的63%。
将测试集机器翻译到英语后评估有助于缩小与英语为中心的大型模型的性能差距。
需要更多努力来开发和适应非洲语言的LLM。

Cool Papers

点此查看论文截图

Can LLMs Solve longer Math Word Problems Better?

Authors:Xin Xu, Tong Xiao, Zitong Chao, Zhenya Huang, Can Yang, Yang Wang

Math Word Problems (MWPs) play a vital role in assessing the capabilities of Large Language Models (LLMs), yet current research primarily focuses on questions with concise contexts. The impact of longer contexts on mathematical reasoning remains under-explored. This study pioneers the investigation of Context Length Generalizability (CoLeG), which refers to the ability of LLMs to solve MWPs with extended narratives. We introduce Extended Grade-School Math (E-GSM), a collection of MWPs featuring lengthy narratives, and propose two novel metrics to evaluate the efficacy and resilience of LLMs in tackling these problems. Our analysis of existing zero-shot prompting techniques with proprietary LLMs along with open-source LLMs reveals a general deficiency in CoLeG. To alleviate these issues, we propose tailored approaches for different categories of LLMs. For proprietary LLMs, we introduce a new instructional prompt designed to mitigate the impact of long contexts. For open-source LLMs, we develop a novel auxiliary task for fine-tuning to enhance CoLeG. Our comprehensive results demonstrate the effectiveness of our proposed methods, showing improved performance on E-GSM. Additionally, we conduct an in-depth analysis to differentiate the effects of semantic understanding and reasoning efficacy, showing that our methods improves the latter. We also establish the generalizability of our methods across several other MWP benchmarks. Our findings highlight the limitations of current LLMs and offer practical solutions correspondingly, paving the way for further exploration of model generalizability and training methodologies.

数学文字题（MWPs）在评估大型语言模型（LLM）的能力方面起着至关重要的作用，然而当前的研究主要集中在简洁语境的问题上。较长语境对数学推理的影响仍被忽视。本研究率先探讨了语境长度泛化（CoLeG）问题，即LLM解决具有扩展叙述的MWPs的能力。我们引入了扩展小学数学（E-GSM），这是一系列带有冗长叙述的MWPs，并提出了两种新颖的指标来评估LLM在处理这些问题时的效果和韧性。我们对现有的零样本提示技术与专有LLM以及开源LLM的分析表明，CoLeG普遍存在不足。为了缓解这些问题，我们为不同类型的LLM提出了针对性的方法。对于专有LLM，我们引入了一种新的指令提示，旨在减轻长语境的影响。对于开源LLM，我们开发了一种用于微调的新型辅助任务，以提高CoLeG。我们的全面结果证明了所提出方法的有效性，显示出在E-GSM上的性能改进。此外，我们进行了深入的分析，以区分语义理解和推理效果的影响，表明我们的方法提高了后者的效果。我们的方法在其他数学文字题基准测试中的通用性也得到了证实。我们的研究突出了当前LLM的局限性，并提供了相应的实用解决方案，为模型的泛化能力和训练方法的进一步探索铺平了道路。

论文及项目相关链接

PDF Accepted to ICLR 2025

摘要

本文研究了大型语言模型（LLM）在处理带有较长语境的数学字问题（MWPs）时的表现。文章介绍了语境长度泛化（CoLeG）的概念，并探讨了LLM解决具有扩展叙事的问题的能力。文章引入了一种新的数学字问题集Extended Grade-School Math（E-GSM），并提出了两种新指标来评估LLM解决这些问题的效果和韧性。文章分析了现有的零提示提示技术与开源LLM和专有LLM的局限性，并提出了针对性的解决方案。对于专有LLM，引入了一种新的指令提示来缓解长语境的影响。对于开源LLM，我们开发了一种新颖的精细调整辅助任务来增强CoLeG。实验结果显示，所提出的方法在E-GSM上表现出良好的性能提升。此外，文章还深入分析了语义理解和推理效果的不同影响，并建立了方法在其他数学字问题基准测试上的泛化能力。

关键见解

当前研究主要关注具有简洁语境的数学字问题，而较长语境对数学推理的影响尚未得到充分探索。
引入Context Length Generalizability（CoLeG）概念，以评估LLM解决具有扩展叙事的数学字问题的能力。
提出Extended Grade-School Math（E-GSM）数据集，包含具有较长叙事的问题。
介绍两种新指标来评估LLM在解决这些问题时的效果和韧性。
分析现有零提示提示技术的局限性，并针对专有LLM和开源LLM提出解决方案。
对于专有LLM，引入新的指令提示来缓解长语境的影响；对于开源LLM，开发新颖的精细调整辅助任务以增强CoLeG。
实验结果显示所提出方法的有效性，并深入分析了语义理解和推理效果的不同影响。

Cool Papers

点此查看论文截图

TT-BLIP: Enhancing Fake News Detection Using BLIP and Tri-Transformer

Authors:Eunjee Choi, Jong-Kook Kim

Detecting fake news has received a lot of attention. Many previous methods concatenate independently encoded unimodal data, ignoring the benefits of integrated multimodal information. Also, the absence of specialized feature extraction for text and images further limits these methods. This paper introduces an end-to-end model called TT-BLIP that applies the bootstrapping language-image pretraining for unified vision-language understanding and generation (BLIP) for three types of information: BERT and BLIPTxt for text, ResNet and BLIPImg for images, and bidirectional BLIP encoders for multimodal information. The Multimodal Tri-Transformer fuses tri-modal features using three types of multi-head attention mechanisms, ensuring integrated modalities for enhanced representations and improved multimodal data analysis. The experiments are performed using two fake news datasets, Weibo and Gossipcop. The results indicate TT-BLIP outperforms the state-of-the-art models.

检测假新闻已经引起了广泛关注。许多以前的方法会连接独立编码的单模态数据，忽略了集成多模态信息的好处。此外，缺少针对文本和图像的专业特征提取进一步限制了这些方法。本文介绍了一种端到端的模型，称为TT-BLIP，该模型应用引导式语言图像预训练进行统一的视觉语言理解和生成（BLIP），涵盖三种类型的信息：BERT和BLIPTxt用于文本，ResNet和BLIPImg用于图像，双向BLIP编码器用于多模态信息。多模态三Transformer通过三种多头注意力机制融合三模态特征，确保集成模式用于增强表示和改进多模态数据分析。实验采用两个假新闻数据集微博和Gossipcop进行。结果表明，TT-BLIP优于最新模型。

论文及项目相关链接

PDF 8 pages, Accepted 27th International Conference on Information Fusion, FUSION 2024

Summary：

本文介绍了一种名为TT-BLIP的端到端模型，该模型采用预训练的跨语言图像理解生成（BLIP）技术，结合文本和图像的多模态信息，进行统一的视觉和语言理解。模型通过三种信息类型（文本、图像和多模态信息）和三种多头注意力机制融合三模态特征，提高了多模态数据的表示和分析能力。实验结果表明，TT-BLIP在假新闻检测方面优于现有技术。

Key Takeaways:

许多早期假新闻检测方法忽略多模态信息的融合优势，仅连接独立编码的单模态数据。
TT-BLIP模型采用预训练的BLIP技术，支持统一视觉和语言理解生成。
TT-BLIP支持三种信息类型：文本（BERT和BLIPTxt）、图像（ResNet和BLIPImg）以及多模态信息。
多模态三模态转换器通过三种多头注意力机制融合三模态特征。
模型提高了多模态数据的表示和分析能力。
实验在Weibo和Gossipcop两个假新闻数据集上进行，验证了TT-BLIP模型的性能优越性。

Cool Papers

点此查看论文截图

Transformer-Based Model for Monocular Visual Odometry: A Video Understanding Approach

Authors:André O. Françani, Marcos R. O. A. Maximo

Estimating the camera’s pose given images from a single camera is a traditional task in mobile robots and autonomous vehicles. This problem is called monocular visual odometry and often relies on geometric approaches that require considerable engineering effort for a specific scenario. Deep learning methods have been shown to be generalizable after proper training and with a large amount of available data. Transformer-based architectures have dominated the state-of-the-art in natural language processing and computer vision tasks, such as image and video understanding. In this work, we deal with the monocular visual odometry as a video understanding task to estimate the 6 degrees of freedom of a camera’s pose. We contribute by presenting the TSformer-VO model based on spatio-temporal self-attention mechanisms to extract features from clips and estimate the motions in an end-to-end manner. Our approach achieved competitive state-of-the-art performance compared with geometry-based and deep learning-based methods on the KITTI visual odometry dataset, outperforming the DeepVO implementation highly accepted in the visual odometry community. The code is publicly available at https://github.com/aofrancani/TSformer-VO.

根据单相机图像估计相机的姿态是移动机器人和自动驾驶汽车中的一个传统任务。这个问题被称为单目视觉里程计，通常依赖于特定场景需要大量工程努力的几何方法。深度学习方法在适当的训练和大量可用数据之后，已显示出可推广性。基于Transformer的架构在诸如图像和视频理解等领域的自然语言处理和计算机视觉任务中占据了最先进的地位。在这项工作中，我们将单目视觉里程计作为视频理解任务来处理，以估计相机的六自由度姿态。我们提出了一种基于时空自注意力机制的TSformer-VO模型，以从片段中提取特征并以端到端的方式估计运动。我们的方法在KITTI视觉里程计数据集上与基于几何和基于深度学习的方法相比，取得了具有竞争力的最先进的性能，并超越了视觉里程计社区中广受欢迎的DeepVO实现。代码可在https://github.com/aofrancani/TSformer-VO上公开访问。

论文及项目相关链接

PDF This work has been accepted for publication in IEEE Access

Summary

本文介绍了基于Transformer架构的TSformer-VO模型在单目视觉里程计任务中的应用。该模型利用时空自注意力机制提取视频片段特征，以端到端的方式估计相机姿态的六个自由度。在KITTI视觉里程计数据集上的性能表现优异，超过了基于几何和深度学习的方法，包括被广泛接受的DeepVO实现。

Key Takeaways

TSformer-VO模型基于Transformer架构，用于处理单目视觉里程计任务。
该模型利用时空自注意力机制提取视频片段特征。
TSformer-VO以端到端的方式估计相机姿态的六个自由度。
在KITTI视觉里程计数据集上，TSformer-VO性能表现优异，超过了基于几何和深度学习的方法。
TSformer-VO模型公开可用，网址为https://github.com/aofrancani/TSformer-VO。
该研究展示了深度学习方法的通用性，经过适当训练和大量数据，可以取得良好性能。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-01-25/LLM/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

LLM

Agent

Agent 方向最新论文已更新，请持续关注 Update in 2025-01-25 Scalable Safe Multi-Agent Reinforcement Learning for Multi-Agent System

2025-01-25 Agent

Agent

Interactive

Interactive 方向最新论文已更新，请持续关注 Update in 2025-01-24 Training Dialogue Systems by AI Feedback for Improving Overall Dialogue Impression

2025-01-24 Interactive

Interactive

LLM

2025-01-25 更新

CRPO: Confidence-Reward Driven Preference Optimization for Machine Translation

Privacy-Preserving Personalized Federated Prompt Learning for Multimodal Large Language Models

Exploring Finetuned Audio-LLM on Heart Murmur Features

Predicting Compact Phrasal Rewrites with Large Language Models for ASR Post Editing

Enhancing LLMs for Governance with Human Oversight: Evaluating and Aligning LLMs on Expert Classification of Climate Misinformation for Detecting False or Misleading Claims about Climate Change

Tune In, Act Up: Exploring the Impact of Audio Modality-Specific Edits on Large Audio Language Models in Jailbreak

UGMathBench: A Diverse and Dynamic Benchmark for Undergraduate-Level Mathematical Reasoning with Large Language Models

RAMQA: A Unified Framework for Retrieval-Augmented Multi-Modal Question Answering

ARTEMIS-DA: An Advanced Reasoning and Transformation Engine for Multi-Step Insight Synthesis in Data Analytics

Interactive Cycle Model: The Linkage Combination among Automatic Speech Recognition, Large Language Models and Smart Glasses

Reducing Reasoning Costs: The Path of Optimization for Chain of Thought via Sparse Attention Mechanism

DART: Denoising Autoregressive Transformer for Scalable Text-to-Image Generation

NESTFUL: A Benchmark for Evaluating LLMs on Nested Sequences of API Calls

PhishAgent: A Robust Multimodal Agent for Phishing Webpage Detection

RegMix: Data Mixture as Regression for Language Model Pre-training

Evaluating LLMs for Quotation Attribution in Literary Texts: A Case Study of LLaMa3

IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models

Can LLMs Solve longer Math Word Problems Better?

TT-BLIP: Enhancing Fake News Detection Using BLIP and Tri-Transformer

Transformer-Based Model for Monocular Visual Odometry: A Video Understanding Approach