LLM

发布日期: 2025-01-06

更新日期: 2025-05-14

文章字数: 17.1k

阅读时长: 70 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-01-06 更新

Unifying Specialized Visual Encoders for Video Language Models

Authors:Jihoon Chung, Tyler Zhu, Max Gonzalez Saez-Diez, Juan Carlos Niebles, Honglu Zhou, Olga Russakovsky

The recent advent of Large Language Models (LLMs) has ushered sophisticated reasoning capabilities into the realm of video through Video Large Language Models (VideoLLMs). However, VideoLLMs currently rely on a single vision encoder for all of their visual processing, which limits the amount and type of visual information that can be conveyed to the LLM. Our method, MERV, Multi-Encoder Representation of Videos, instead leverages multiple frozen visual encoders to create a unified representation of a video, providing the VideoLLM with a comprehensive set of specialized visual knowledge. Spatio-temporally aligning the features from each encoder allows us to tackle a wider range of open-ended and multiple-choice video understanding questions and outperform prior state-of-the-art works. MERV is up to 3.7% better in accuracy than Video-LLaVA across the standard suite video understanding benchmarks, while also having a better Video-ChatGPT score. We also improve upon SeViLA, the previous best on zero-shot Perception Test accuracy, by 2.2%. MERV introduces minimal extra parameters and trains faster than equivalent single-encoder methods while parallelizing the visual processing. Finally, we provide qualitative evidence that MERV successfully captures domain knowledge from each of its encoders. Our results offer promising directions in utilizing multiple vision encoders for comprehensive video understanding.

随着大型语言模型（LLM）的最近出现，通过视频大型语言模型（VideoLLM）将复杂的推理能力引入了视频领域。然而，VideoLLM目前依赖于单一视觉编码器进行所有视觉处理，这限制了可以传达给LLM的视觉信息量和类型。我们的方法，MERV（视频的多元编码器表示），相反地利用多个冻结的视觉编码器来创建视频的统一表示，为VideoLLM提供一套全面的专业视觉知识。通过时空对齐每个编码器的特征，我们能够解决更广泛的开放式和选择性视频理解问题，并超越先前的最新技术水平。在标准视频理解基准测试中，MERV的准确度比Video-LLaVA高出3.7%，同时拥有更高的Video-ChatGPT分数。我们还提高了零样本感知测试准确性的前任最佳SeViLA的准确度，提高了2.2%。MERV引入了极少的额外参数，并且相对于单编码器方法训练速度更快，同时并行处理视觉。最后，我们提供了定性证据表明MERV成功地从每个编码器中捕获了领域知识。我们的研究结果提供了利用多个视觉编码器进行全面视频理解的充满希望的方向。

论文及项目相关链接

PDF Project page: https://tylerzhu.com/merv/

Summary

大型语言模型（LLMs）通过视频大型语言模型（VideoLLMs）为视频领域带来了先进的推理能力。然而，现有VideoLLMs仅依赖单一视觉编码器进行所有视觉处理，限制了能够传达给LLM的视觉信息量和类型。我们的方法MERV（Multi-Encoder Representation of Videos）采用多个冻结的视觉编码器来创建视频的统一表示，为VideoLLM提供专门的视觉知识集。通过时空对齐每个编码器的特征，我们能够解决更广泛的开放问题和多选择题型的视频理解问题，并超越了现有最先进的作品。MERV在标准视频理解基准测试上的准确率比Video-LLaVA高出3.7%，同时拥有更高的Video-ChatGPT分数。我们还提高了零样本感知测试准确性的先前最佳成绩SeViLA，提高了2.2%。MERV在增加少量额外参数的同时，实现了比单编码器方法更快的训练速度，并行进行视觉处理。最后，我们提供定性证据表明MERV成功捕获了每个编码器的领域知识。我们的研究结果表明，利用多个视觉编码器进行视频理解的潜力巨大。

Key Takeaways

VideoLLMs目前依赖于单一视觉编码器，限制了视频理解的深度和广度。
MERV方法采用多个冻结的视觉编码器来处理视频，为VideoLLM提供更全面的视觉知识。
MERV通过时空对齐特征，提高了对开放问题和多选择题型的视频理解问题的处理能力。
MERV在标准视频理解基准测试上的准确率优于现有方法。
MERV在零样本感知测试准确性方面取得了显著改进。
MERV具有较少的额外参数，训练速度快，可并行处理视觉任务。

Cool Papers

点此查看论文截图

OmniChat: Enhancing Spoken Dialogue Systems with Scalable Synthetic Data for Diverse Scenarios

Authors:Xize Cheng, Dongjie Fu, Xiaoda Yang, Minghui Fang, Ruofan Hu, Jingyu Lu, Bai Jionghao, Zehan Wang, Shengpeng Ji, Rongjie Huang, Linjun Li, Yu Chen, Tao Jin, Zhou Zhao

With the rapid development of large language models, researchers have created increasingly advanced spoken dialogue systems that can naturally converse with humans. However, these systems still struggle to handle the full complexity of real-world conversations, including audio events, musical contexts, and emotional expressions, mainly because current dialogue datasets are constrained in both scale and scenario diversity. In this paper, we propose leveraging synthetic data to enhance the dialogue models across diverse scenarios. We introduce ShareChatX, the first comprehensive, large-scale dataset for spoken dialogue that spans diverse scenarios. Based on this dataset, we introduce OmniChat, a multi-turn dialogue system with a heterogeneous feature fusion module, designed to optimize feature selection in different dialogue contexts. In addition, we explored critical aspects of training dialogue systems using synthetic data. Through comprehensive experimentation, we determined the ideal balance between synthetic and real data, achieving state-of-the-art results on the real-world dialogue dataset DailyTalk. We also highlight the crucial importance of synthetic data in tackling diverse, complex dialogue scenarios, especially those involving audio and music. For more details, please visit our demo page at \url{https://sharechatx.github.io/}.

随着大型语言模型的快速发展，研究者已经开发出了越来越先进的对话系统，可以自然地与人类进行对话。然而，这些系统在处理真实世界的对话的复杂性方面仍然面临挑战，包括音频事件、音乐背景和情绪表达，这主要是因为当前的对话数据集在规模和场景多样性方面存在限制。在本文中，我们提出利用合成数据来增强对话模型在不同场景中的应用。我们推出了ShareChatX，这是首个全面、大规模的对话数据集，涵盖各种场景。基于此数据集，我们推出了OmniChat，这是一个多轮对话系统，配备有异质特征融合模块，旨在优化不同对话语境中的特征选择。此外，我们还探讨了使用合成数据训练对话系统的关键方面。通过全面的实验，我们确定了合成数据与真实数据之间的理想平衡，并在真实世界的对话数据集DailyTalk上取得了最新的结果。我们还强调了合成数据在处理多样、复杂的对话场景中的关键作用，特别是涉及音频和音乐的场景。如需更多详细信息，请访问我们的演示页面：[https://sharechatx.github.io/]。

论文及项目相关链接

PDF

Summary
本文介绍了大型语言模型的快速发展，并指出当前对话系统处理真实世界对话的复杂性时面临的挑战。为了解决这个问题，作者提出了利用合成数据增强对话模型的方法，并介绍了ShareChatX数据集和OmniChat多轮对话系统。实验表明，合成数据和真实数据的平衡可以取得最佳效果，并在真实世界对话数据集DailyTalk上达到最新水平。合成数据在应对涉及音频和音乐的复杂对话场景方面尤其重要。

Key Takeaways

当前对话系统面临处理真实世界对话复杂性的挑战，尤其是音频事件、音乐背景和情感表达方面。
ShareChatX数据集是首个全面、大规模的对话数据集，涵盖多种场景。
OmniChat是一个多轮对话系统，具有异质特征融合模块，旨在优化不同对话语境中的特征选择。
合成数据在训练对话系统中具有重要作用。
合成数据和真实数据的平衡对于取得最佳实验效果至关重要。
在真实世界对话数据集DailyTalk上取得了最新水平的效果。

Cool Papers

点此查看论文截图

Aligning Large Language Models for Faithful Integrity Against Opposing Argument

Authors:Yong Zhao, Yang Deng, See-Kiong Ng, Tat-Seng Chua

Large Language Models (LLMs) have demonstrated impressive capabilities in complex reasoning tasks. However, they can be easily misled by unfaithful arguments during conversations, even when their original statements are correct. To this end, we investigate the problem of maintaining faithful integrity in LLMs. This involves ensuring that LLMs adhere to their faithful statements in the face of opposing arguments and are able to correct their incorrect statements when presented with faithful arguments. In this work, we propose a novel framework, named Alignment for Faithful Integrity with Confidence Estimation (AFICE), which aims to align the LLM responses with faithful integrity. Specifically, AFICE first designs a Bilateral Confidence Estimation (BCE) approach for estimating the uncertainty of each response generated by the LLM given a specific context, which simultaneously estimate the model’s confidence to the question based on the internal states during decoding as well as to the answer based on cumulative probability ratios. With the BCE, we construct a conversational preference dataset composed of context, original statement, and argument, which is adopted for aligning the LLM for faithful integrity using Direct Preference Optimization (DPO). Extensive experimental results on a wide range of benchmarks demonstrate significant improvements in the LLM’s ability to maintain faithful responses when encountering opposing arguments, ensuring both the practical utility and trustworthiness of LLMs in complex interactive settings. Code and data will be released via https://github.com/zhaoy777/AFICE.git

大型语言模型（LLM）在复杂的推理任务中展现出了令人印象深刻的能力。然而，它们在对话中很容易受到不真实论据的误导，即使它们的原始陈述是正确的。为此，我们研究了在LLM中保持忠实性的问题。这涉及到确保LLM在面对相反的论点时坚持其忠实的陈述，并在面对真实的论据时能够纠正其不正确的陈述。在这项工作中，我们提出了一个名为“基于信心估计的忠实性对齐”（AFICE）的新型框架，旨在使LLM的回应与忠实性保持一致。具体来说，AFICE首先设计了一种双向信心估计（BCE）方法，用于估计LLM在给定的特定上下文中所生成的每个回应的不确定性，该方法同时基于解码过程中的内部状态估计模型对问题的信心以及对答案的信心是基于累积概率比的。通过BCE，我们构建了一个包含上下文、原始陈述和论据的对话偏好数据集，该数据集采用直接偏好优化（DPO）来对齐LLM的忠实性。在广泛的标准测试上的大量实验结果表明，在面临相反的论点时，LLM在保持忠实回应的能力方面取得了显著的改进，确保了LLM在复杂的交互式环境中的实用性和可信度。代码和数据将通过https://github.com/zhaoy777/AFICE.git发布。

论文及项目相关链接

PDF 17 pages, 5 figures

Summary
大型语言模型（LLM）在复杂推理任务中表现出强大的能力，但在对话中容易受不真实论据的影响而偏离原有正确观点。为解决这个问题，研究者提出了名为AFICE的新框架，旨在提高LLM的忠实度并增强其在面对对立观点时的信心评估。AFICE通过设计双向信心估计（BCE）方法，评估LLM在特定语境下的回应不确定性，并基于解码过程中的内部状态和累积概率比率来评估模型对答案的信心。使用BCE构建了一个包含语境、原始陈述和论据的对话偏好数据集，并采用直接偏好优化（DPO）来对齐LLM的忠实度。在广泛的基准测试中，LLM在面临对立观点时，保持忠实回应的能力得到了显著提高，这确保了其在复杂交互环境中的实用性和可靠性。

Key Takeaways

LLM在复杂推理任务中表现出强大的能力，但易受不真实论据影响。
AFICE框架旨在提高LLM的忠实度并增强其信心评估。
AFICE通过BCE方法评估LLM回应的不确定性。
双向信心估计考虑了模型对问题和答案的信心。
使用BCE构建了一个对话偏好数据集，用于提高LLM的忠实度。
直接偏好优化（DPO）用于对齐LLM的忠实度。

Cool Papers

点此查看论文截图

CySecBench: Generative AI-based CyberSecurity-focused Prompt Dataset for Benchmarking Large Language Models

Authors:Johan Wahréus, Ahmed Mohamed Hussain, Panos Papadimitratos

Numerous studies have investigated methods for jailbreaking Large Language Models (LLMs) to generate harmful content. Typically, these methods are evaluated using datasets of malicious prompts designed to bypass security policies established by LLM providers. However, the generally broad scope and open-ended nature of existing datasets can complicate the assessment of jailbreaking effectiveness, particularly in specific domains, notably cybersecurity. To address this issue, we present and publicly release CySecBench, a comprehensive dataset containing 12662 prompts specifically designed to evaluate jailbreaking techniques in the cybersecurity domain. The dataset is organized into 10 distinct attack-type categories, featuring close-ended prompts to enable a more consistent and accurate assessment of jailbreaking attempts. Furthermore, we detail our methodology for dataset generation and filtration, which can be adapted to create similar datasets in other domains. To demonstrate the utility of CySecBench, we propose and evaluate a jailbreaking approach based on prompt obfuscation. Our experimental results show that this method successfully elicits harmful content from commercial black-box LLMs, achieving Success Rates (SRs) of 65% with ChatGPT and 88% with Gemini; in contrast, Claude demonstrated greater resilience with a jailbreaking SR of 17%. Compared to existing benchmark approaches, our method shows superior performance, highlighting the value of domain-specific evaluation datasets for assessing LLM security measures. Moreover, when evaluated using prompts from a widely used dataset (i.e., AdvBench), it achieved an SR of 78.5%, higher than the state-of-the-art methods.

已有大量研究探讨了突破大型语言模型（LLM）生成有害内容的方法。通常，这些方法使用恶意提示数据集进行评估，这些数据集旨在绕过LLM提供商建立的安全策略。然而，现有数据集的广泛范围和开放性特点可能会使评估突破效果变得复杂，特别是在特定领域，如网络安全领域尤为如此。为了解决这个问题，我们推出并公开发布CySecBench数据集，该数据集包含专门用于评估网络安全领域突破技术的12662个提示。数据集分为10个独特的攻击类型类别，采用封闭式提示，以便更一致和准确地评估突破尝试。此外，我们详细介绍了数据集生成和过滤的方法，该方法可适应于在其他领域创建类似数据集。为了展示CySecBench的实用性，我们提出了一种基于提示模糊化的突破方法，并对其进行了评估和演示。实验结果表明，该方法成功地从商业黑箱LLM中引发了有害内容，使用ChatGPT的成功率为65%，使用Gemini的成功率为88%，而Claude表现出更强的抗突破性，成功率为17%。与现有的基准方法相比，我们的方法表现出更优越的性能，突显了针对特定领域的评估数据集在评估LLM安全措施方面的价值。此外，在使用广泛使用的数据集（即AdvBench）中的提示进行评估时，其成功率达到了78.5%，高于最新方法。

论文及项目相关链接

PDF

摘要

本文介绍了针对大型语言模型（LLM）破解方法的研究，研究人员创建并公开发布了CySecBench数据集，该数据集包含专门用于评估网络安全领域LLM破解技术的12662个提示。数据集分为10个不同的攻击类型类别，采用封闭式提示，以便更一致、准确地评估破解尝试。此外，本文详细介绍了数据集生成和过滤的方法论，该方法可以适应其他领域的类似数据集创建。为证明CySecBench的实用性，研究人员提出了一种基于提示混淆的破解方法，并进行了评估。实验结果表明，该方法成功诱导商业黑箱LLM生成有害内容，在ChatGPT上的成功率为65%，在Gemini上的成功率高达88%，而Claude显示出更强的抗破解性，成功率仅为17%。与现有的基准方法相比，该方法表现出卓越的性能，强调领域特定评估数据集在评估LLM安全措施方面的价值。使用广泛使用的数据集（如AdvBench）中的提示进行评估时，其成功率达到78.5%，高于现有最佳方法。

关键见解

介绍了专门用于评估网络安全领域LLM破解技术的CySecBench数据集，包含12662个特定提示。
数据集分为10个攻击类型类别，采用封闭式提示，以提高评估的一致性和准确性。
详细介绍数据集生成和过滤的方法论，可适应其他领域的类似数据集创建。
基于提示混淆的破解方法被提出并评估，表现出卓越的性能。
该方法在商业黑箱LLM上成功诱导生成有害内容，不同模型的成功率有所不同。
与现有方法相比，该方法在评估LLM安全措施方面表现出更高的价值。
使用其他数据集（如AdvBench）进行评估时，该方法仍表现出较高的成功率。

Cool Papers

点此查看论文截图

The Prompt Alchemist: Automated LLM-Tailored Prompt Optimization for Test Case Generation

Authors:Shuzheng Gao, Chaozheng Wang, Cuiyun Gao, Xiaoqian Jiao, Chun Yong Chong, Shan Gao, Michael Lyu

Test cases are essential for validating the reliability and quality of software applications. Recent studies have demonstrated the capability of Large Language Models (LLMs) to generate useful test cases for given source code. However, the existing work primarily relies on human-written plain prompts, which often leads to suboptimal results since the performance of LLMs can be highly influenced by the prompts. Moreover, these approaches use the same prompt for all LLMs, overlooking the fact that different LLMs might be best suited to different prompts. Given the wide variety of possible prompt formulations, automatically discovering the optimal prompt for each LLM presents a significant challenge. Although there are methods on automated prompt optimization in the natural language processing field, they are hard to produce effective prompts for the test case generation task. First, the methods iteratively optimize prompts by simply combining and mutating existing ones without proper guidance, resulting in prompts that lack diversity and tend to repeat the same errors in the generated test cases. Second, the prompts are generally lack of domain contextual knowledge, limiting LLMs’ performance in the task.

测试用例对于验证软件应用的可靠性和质量至关重要。最近的研究表明，大型语言模型（LLM）能够为给定的源代码生成有用的测试用例。然而，现有工作主要依赖于人工编写的普通提示，这往往会导致结果不尽如人意，因为语言模型的性能会受到提示的很大影响。此外，这些方法对所有语言模型使用相同的提示，忽略了不同语言模型可能最适合不同提示的事实。考虑到可能存在的各种提示形式，为每个语言模型自动发现最佳提示是一个巨大的挑战。虽然自然语言处理领域有自动提示优化的方法，但它们很难为测试用例生成任务生成有效的提示。首先，这些方法通过简单地组合和变异现有提示来迭代优化提示，而没有适当的指导，导致生成的提示缺乏多样性，并在生成的测试用例中重复相同的错误。其次，这些提示通常缺乏领域上下文知识，限制了语言模型在此任务中的性能。

论文及项目相关链接

PDF

Summary

大型语言模型（LLM）在生成源代码测试案例方面表现出巨大潜力，但现有研究主要依赖人工编写的提示，这往往导致结果不尽如人意。不同LLM可能需要不同的提示，自动发现每个LLM的最佳提示是一个挑战。当前自动优化提示的方法缺乏多样性、重复度高且缺乏领域上下文知识，限制了LLM在测试案例生成任务中的表现。

Key Takeaways

测试案例在验证软件应用的可靠性和质量方面至关重要。
LLM有能力生成有用的测试案例。
现有方法主要依赖人工编写的提示，导致结果可能不佳。
不同LLM可能需要不同的提示。
自动发现每个LLM的最佳提示是一个挑战。
当前自动优化提示的方法缺乏多样性，且容易重复错误。
当前方法缺乏领域上下文知识，限制了LLM在测试案例生成任务中的表现。

Cool Papers

点此查看论文截图

Think More, Hallucinate Less: Mitigating Hallucinations via Dual Process of Fast and Slow Thinking

Authors:Xiaoxue Cheng, Junyi Li, Wayne Xin Zhao, Ji-Rong Wen

Large language models (LLMs) demonstrate exceptional capabilities, yet still face the hallucination issue. Typical text generation approaches adopt an auto-regressive generation without deliberate reasoning, which often results in untrustworthy and factually inaccurate responses. In this paper, we propose HaluSearch, a novel framework that incorporates tree search-based algorithms (e.g. MCTS) to enable an explicit slow thinking generation process for mitigating hallucinations of LLMs during inference. Specifically, HaluSearch frames text generation as a step-by-step reasoning process, using a self-evaluation reward model to score each generation step and guide the tree search towards the most reliable generation pathway for fully exploiting the internal knowledge of LLMs. To balance efficiency and quality, we introduce a hierarchical thinking system switch mechanism inspired by the dual process theory in cognitive science, which dynamically alternates between fast and slow thinking modes at both the instance and step levels, adapting to the complexity of questions and reasoning states. We conduct extensive experiments on both English and Chinese datasets and the results show that our approach significantly outperforms baseline approaches.

大型语言模型（LLM）展现出卓越的能力，但仍面临虚构问题。典型的文本生成方法采用自回归生成方式，无需刻意推理，这往往导致生成不可信的、事实错误的回应。在本文中，我们提出了HaluSearch，这是一个结合树搜索算法（如MCTS）的新框架，通过明确的缓慢思考生成过程，减轻LLM在推理过程中的虚构现象。具体来说，HaluSearch将文本生成构建为一个逐步推理过程，使用自我评估奖励模型对每一步生成进行评分，并引导树搜索走向最可靠的生成路径，以充分利用LLM的内部知识。为了平衡效率和质量，我们引入了基于认知科学中的双过程理论的分层思维系统切换机制。该机制在实例和步骤层面动态切换快速和慢速思考模式，以适应问题的复杂性和推理状态。我们在英文和中文数据集上进行了大量实验，结果表明我们的方法显著优于基准方法。

论文及项目相关链接

PDF

Summary

LLM存在hallucination问题，本文提出一种结合树搜索算法（如MCTS）的新框架HaluSearch，实现显式慢思考生成过程，以减轻推理过程中的hallucination问题。HaluSearch将文本生成视为逐步推理过程，利用自评估奖励模型为每一步打分，引导树搜索走向最可靠的生成路径，实现LLM内部知识的充分利用。为提高效率和生成质量，引入分层思维系统切换机制，根据问题和推理状态的复杂性在快慢思维模式间动态切换。实验表明，该方法显著优于基线方法。

Key Takeaways

LLM面临hallucination问题，需要新的框架来解决。
HaluSearch框架结合树搜索算法，实现显式慢思考生成过程。
HaluSearch将文本生成视为逐步推理过程，利用自评估奖励模型引导树搜索。
分层思维系统切换机制根据问题和推理状态的复杂性动态切换思维模式。
HaluSearch在平衡效率和生成质量方面表现出优异性能。
实验证明，HaluSearch显著优于基线方法。

Cool Papers

点此查看论文截图

Citations and Trust in LLM Generated Responses

Authors:Yifan Ding, Matthew Facciani, Amrit Poudel, Ellen Joyce, Salvador Aguinaga, Balaji Veeramani, Sanmitra Bhattacharya, Tim Weninger

Question answering systems are rapidly advancing, but their opaque nature may impact user trust. We explored trust through an anti-monitoring framework, where trust is predicted to be correlated with presence of citations and inversely related to checking citations. We tested this hypothesis with a live question-answering experiment that presented text responses generated using a commercial Chatbot along with varying citations (zero, one, or five), both relevant and random, and recorded if participants checked the citations and their self-reported trust in the generated responses. We found a significant increase in trust when citations were present, a result that held true even when the citations were random; we also found a significant decrease in trust when participants checked the citations. These results highlight the importance of citations in enhancing trust in AI-generated content.

问答系统正在迅速发展，但其不透明的特性可能影响用户信任。我们通过一个反监控框架来探索信任，在该框架中，信任预计与引用存在正相关，与引用核查成负相关。我们通过实时问答实验测试了这一假设，实验中呈现了使用商业聊天机器人生成的文本答案，并附有不同数量的引用（零个、一个或五个），包括相关引用和随机引用。我们记录了参与者是否核查引用，以及他们对生成答案的自我报告信任程度。我们发现，当存在引用时，信任度显著增加，即使在引用是随机的情况下也是如此；我们还发现，当参与者检查引用时，信任度显著降低。这些结果突出了引用在提高人工智能生成内容信任度方面的重要性。

论文及项目相关链接

PDF Accepted to AAAI 2025

Summary：问答系统的快速发展带来了用户信任的问题。研究通过反监控框架探索信任，预测信任与引证的存在正相关，与引证查证行为成负相关。通过实时问答实验验证此假设，呈现由商业聊天机器人生成的文本答案，附带不同数量的引证（零、一或五个），既有相关性也有随机性。实验发现引证增加显著提高信任度，即使引证随机也不影响此结果；当参与者查证引证时，信任度显著降低。这突显了引证在增强对AI生成内容的信任中的重要性。

Key Takeaways：

问答系统发展快速，但用户信任成为关键问题。
研究通过反监控框架探索信任，发现信任与引证数量正相关。
实时问答实验发现，当答案附有引证时，用户对AI生成答案的信任度显著增加。
即使引证内容随机，对信任度的提升依然显著。
当用户查证引证时，他们对AI答案的信任度会显著降低。
引证的数量和质量对增强AI生成内容的信任度至关重要。

Cool Papers

点此查看论文截图

CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings

Authors:Shanghaoran Quan, Jiaxi Yang, Bowen Yu, Bo Zheng, Dayiheng Liu, An Yang, Xuancheng Ren, Bofei Gao, Yibo Miao, Yunlong Feng, Zekun Wang, Jian Yang, Zeyu Cui, Yang Fan, Yichang Zhang, Binyuan Hui, Junyang Lin

With the increasing code reasoning capabilities of existing large language models (LLMs) and breakthroughs in reasoning models like OpenAI o1 and o3, there is a growing need to develop more challenging and comprehensive benchmarks that effectively test their sophisticated competition-level coding abilities. Existing benchmarks, like LiveCodeBench and USACO, fall short due to the unavailability of private test cases, lack of support for special judges, and misaligned execution environments. To bridge this gap, we introduce CodeElo, a standardized competition-level code generation benchmark that effectively addresses all these challenges for the first time. CodeElo benchmark is mainly based on the official CodeForces platform and tries to align with the platform as much as possible. We compile the recent six months of contest problems on CodeForces with detailed information such as contest divisions, problem difficulty ratings, and problem algorithm tags. We introduce a unique judging method in which problems are submitted directly to the platform and develop a reliable Elo rating calculation system that aligns with the platform and is comparable with human participants but has lower variance. By testing on our CodeElo, we provide the Elo ratings of 30 existing popular open-source and 3 proprietary LLMs for the first time. The results show that o1-mini and QwQ-32B-Preview stand out significantly, achieving Elo ratings of 1578 and 1261, respectively, while other models struggle even with the easiest problems, placing in the lowest 20 percent among all human participants. Detailed analysis experiments are also conducted to provide insights into performance across algorithms and comparisons between using C++ and Python, which can suggest directions for future studies.

随着现有大型语言模型（LLM）的代码推理能力的增强，以及如OpenAI o1和o3等推理模型的突破，需要开发更具挑战性和全面的基准测试，以有效地测试它们的高级竞赛级编码能力。现有的基准测试，如LiveCodeBench和USACO，由于缺少私有测试用例、不支持特设评委以及执行环境不匹配等原因而显得不足。为了弥补这一差距，我们引入了CodeElo，这是一个标准化的竞赛级代码生成基准测试，首次有效地解决了所有这些挑战。CodeElo基准测试主要基于官方的CodeForces平台，并尽可能与平台保持一致。我们整理了CodeForces上近六个月的竞赛题目，包括竞赛分组、问题难度评级和问题算法标签等详细信息。我们采用了一种独特的评判方法，问题直接提交到平台，并开发了一个可靠的Elo评分计算系统，该系统与平台一致，可与人类参赛者进行比较，但具有较低的变化性。通过在我们的CodeElo上进行测试，我们首次提供了30个现有流行开源和3个专有LLM的Elo评分。结果表明，o1-mini和QwQ-3 预示版本表现突出，分别获得了Elo评分达到分高的效果达极高评价和不错的表现一般相较于而言评级居于中游超过其他模型。而其他模型在解决最简单的问题时都面临困难，在所有参赛人类中位列最后百分之二十以内。还进行了详细的实验分析以深入了解不同算法的性能以及使用C++和Python之间的比较对比情况将提供未来研究的建议方向。

论文及项目相关链接

PDF

Summary

基于现有大型语言模型（LLM）日益增强的代码推理能力，以及如OpenAI o1和o3等推理模型的突破，亟需开发更具挑战性和综合性的基准测试，以有效测试它们的高级竞赛级编程能力。CodeElo基准测试填补了现有基准测试（如LiveCodeBench和USACO）的空白，首次有效地解决了这些挑战。CodeElo基准测试主要基于官方的CodeForces平台，并尽可能与之对齐。它通过收集近六个月的CodeForces竞赛问题，引入独特的评判方法和可靠的Elo评分计算系统，为30个现有流行的开源和3个专有LLM提供了第一次的Elo评分。结果初步显示，o1-mini和QwQ-32B-Preview表现突出，而其他模型则在最简单的问题上挣扎。

Key Takeaways

随着大型语言模型（LLM）的代码推理能力日益增强，需要更挑战性和综合性的基准测试来评估它们的竞赛级编程能力。
现有基准测试如LiveCodeBench和USACO存在不足，缺乏私有测试用例、特殊裁判支持和执行环境不匹配等问题。
CodeElo基准测试基于CodeForces平台，旨在解决现有基准测试的不足，提供更全面、可靠的评估方法。
CodeElo引入独特的评判方法和可靠的Elo评分计算系统，与平台对齐并具有与人类参与者可比性但具有较低方差。
CodeElo基准测试收集了近六个月的CodeForces竞赛问题，为多种LLM提供了第一次的Elo评分。
在CodeElo测试中，o1-mini和QwQ-32B-Preview表现优异，而其他模型在简单问题上表现挣扎。

Cool Papers

点此查看论文截图

Digital Guardians: Can GPT-4, Perspective API, and Moderation API reliably detect hate speech in reader comments of German online newspapers?

Authors:Manuel Weber, Moritz Huber, Maximilian Auch, Alexander Döschl, Max-Emanuel Keller, Peter Mandl

In recent years, toxic content and hate speech have become widespread phenomena on the internet. Moderators of online newspapers and forums are now required, partly due to legal regulations, to carefully review and, if necessary, delete reader comments. This is a labor-intensive process. Some providers of large language models already offer solutions for automated hate speech detection or the identification of toxic content. These include GPT-4o from OpenAI, Jigsaw’s (Google) Perspective API, and OpenAI’s Moderation API. Based on the selected German test dataset HOCON34k, which was specifically created for developing tools to detect hate speech in reader comments of online newspapers, these solutions are compared with each other and against the HOCON34k baseline. The test dataset contains 1,592 annotated text samples. For GPT-4o, three different promptings are used, employing a Zero-Shot, One-Shot, and Few-Shot approach. The results of the experiments demonstrate that GPT-4o outperforms both the Perspective API and the Moderation API, and exceeds the HOCON34k baseline by approximately 5 percentage points, as measured by a combined metric of MCC and F2-score.

近年来，有毒内容和仇恨言论已成为互联网上的普遍现象。网络报纸和论坛的管理人员现在部分由于法律条例的要求，需要仔细审查并必要时删除读者评论。这是一个劳动密集型的流程。一些大型语言模型提供商已经提供自动化仇恨言论检测或有毒内容识别的解决方案。这包括OpenAI的GPT-4o、Google Jigsaw的Perspective API和OpenAI的Moderation API。这些解决方案是基于专门为开发在线报纸读者评论中检测仇恨言论的工具而创建的德国测试数据集HOCON34k进行比较的。测试数据集包含1592个注释文本样本。对于GPT-4o，使用了三种不同的提示方法，即零样本、一个样例和几个样例的方法。实验结果表明，GPT-4o在 MCC 和 F2 分数综合指标上的表现优于 Perspective API 和 Moderation API，并且大约高出 HOCON34k 基线 5 个百分点。

论文及项目相关链接

PDF

Summary

互联网上的有毒内容和仇恨言论日益普遍，促使在线报纸和论坛的版主需严格审查读者评论，甚至必要时进行删除，这一过程耗费大量人力。大型语言模型提供商如OpenAI的GPT-4o、Google的Jigsaw Perspective API以及OpenAI的Moderation API等已提供自动检测仇恨言论或有毒内容解决方案。基于专门为检测在线报纸读者评论中的仇恨言论而开发的工具所选择的德国测试数据集HOCON34k，GPT-4o采用零样本、一个样例和少量样本三种提示方法，实验结果显示GPT-4o在综合指标MCC和F2分数上的表现优于Perspective API和Moderation API，并高出HOCON34k基线约5个百分点。

Key Takeaways

互联网上存在广泛的有毒内容和仇恨言论现象，促使在线报纸和论坛版主需要严格审查读者评论。
大型语言模型提供商如OpenAI和Google已经提供了自动检测仇恨言论或有毒内容的解决方案。
基于德国HOCON34k测试数据集的实验结果显示GPT-4o在仇恨言论检测方面表现优秀。
GPT-4o采用三种不同的提示方法，包括零样本、一个样例和少量样本，显示出其灵活性和有效性。
GPT-4o的综合性能优于其他解决方案，如Perspective API和Moderation API。
GPT-4o的检测准确率高于HOCON34k基线约5个百分点。

Cool Papers

点此查看论文截图

SeFAR: Semi-supervised Fine-grained Action Recognition with Temporal Perturbation and Learning Stabilization

Authors:Yongle Huang, Haodong Chen, Zhenbang Xu, Zihan Jia, Haozhou Sun, Dian Shao

Human action understanding is crucial for the advancement of multimodal systems. While recent developments, driven by powerful large language models (LLMs), aim to be general enough to cover a wide range of categories, they often overlook the need for more specific capabilities. In this work, we address the more challenging task of Fine-grained Action Recognition (FAR), which focuses on detailed semantic labels within shorter temporal duration (e.g., “salto backward tucked with 1 turn”). Given the high costs of annotating fine-grained labels and the substantial data needed for fine-tuning LLMs, we propose to adopt semi-supervised learning (SSL). Our framework, SeFAR, incorporates several innovative designs to tackle these challenges. Specifically, to capture sufficient visual details, we construct Dual-level temporal elements as more effective representations, based on which we design a new strong augmentation strategy for the Teacher-Student learning paradigm through involving moderate temporal perturbation. Furthermore, to handle the high uncertainty within the teacher model’s predictions for FAR, we propose the Adaptive Regulation to stabilize the learning process. Experiments show that SeFAR achieves state-of-the-art performance on two FAR datasets, FineGym and FineDiving, across various data scopes. It also outperforms other semi-supervised methods on two classical coarse-grained datasets, UCF101 and HMDB51. Further analysis and ablation studies validate the effectiveness of our designs. Additionally, we show that the features extracted by our SeFAR could largely promote the ability of multimodal foundation models to understand fine-grained and domain-specific semantics.

人类行为理解对于多模态系统的进步至关重要。虽然最近的进展，受到强大的大型语言模型（LLM）的推动，旨在涵盖广泛的类别，但它们往往忽视了更特定能力的需求。在这项工作中，我们解决了更具挑战性的精细动作识别（FAR）任务，该任务关注更短时间内的详细语义标签（例如，“向后翻滚一周并俯身”）。考虑到精细标签标注的高成本和微调LLM所需的大量数据，我们提议采用半监督学习（SSL）。我们的框架SeFAR结合了多种创新设计来解决这些挑战。具体来说，为了捕获足够的视觉细节，我们构建了双层次时间元素作为更有效的表示，并在此基础上设计了新的强力增强策略，通过适度的时间扰动来参与教师-学生学习范式。此外，为了处理教师模型对FAR预测的高不确定性，我们提出了自适应调节来稳定学习过程。实验表明，SeFAR在两个精细动作识别数据集FineGym和FineDiving上，跨越各种数据范围都实现了最先进的性能。在另外两个经典的粗粒度数据集UCF101和HMDB51上，它也优于其他半监督方法。进一步的分析和消融研究验证了我们的设计的有效性。此外，我们还展示了我们的SeFAR所提取的特征可以极大地促进多模态基础模型对精细粒度和特定领域的语义的理解能力。

论文及项目相关链接

PDF AAAI 2025; Code: https://github.com/KyleHuang9/SeFAR

Summary

本文关注在多模态系统发展中，精细动作识别（FAR）的重要性。针对精细动作识别所需的具体能力，提出采用半监督学习（SSL）的方法来解决标注成本高和需要大量数据微调的问题。文章提出了一种名为SeFAR的框架，通过构建双层次时间元素作为更有效的表示，设计了一种新的增强策略，并采用自适应调节来处理教师模型预测的高不确定性。实验证明，SeFAR在精细动作识别数据集上取得了最新性能，并在经典粗粒度数据集上也超越了其他半监督方法。此外，SeFAR提取的特征大大提高了多模态基础模型对精细粒度和特定领域的语义理解能力。

Key Takeaways

人机交互系统需要精细动作识别（FAR）来提升性能，尤其是在多模态系统中。
标注精细动作标签的成本高昂且需要大量数据对模型进行微调。
提出了一种基于半监督学习（SSL）的SeFAR框架来解决上述问题。
SeFAR通过构建双层次时间元素更有效地捕捉视觉细节，并设计了创新的增强策略。
SeFAR采用自适应调节来处理教师模型预测的高不确定性，确保学习过程的稳定性。
SeFAR在多个数据集上的实验表现达到最新性能水平，超越了其他半监督方法。

Cool Papers

点此查看论文截图

Authors:Lixiong Qin, Shilong Ou, Miaoxuan Zhang, Jiangning Wei, Yuhang Zhang, Xiaoshuai Song, Yuchen Liu, Mei Wang, Weiran Xu

Faces and humans are crucial elements in social interaction and are widely included in everyday photos and videos. Therefore, a deep understanding of faces and humans will enable multi-modal assistants to achieve improved response quality and broadened application scope. Currently, the multi-modal assistant community lacks a comprehensive and scientific evaluation of face and human understanding abilities. In this paper, we first propose a hierarchical ability taxonomy that includes three levels of abilities. Then, based on this taxonomy, we collect images and annotations from publicly available datasets in the face and human community and build a semi-automatic data pipeline to produce problems for the new benchmark. Finally, the obtained Face-Human-Bench comprises a development set with 900 problems and a test set with 1800 problems, supporting both English and Chinese. We conduct evaluations over 25 mainstream multi-modal large language models (MLLMs) with our Face-Human-Bench, focusing on the correlation between abilities, the impact of the relative position of targets on performance, and the impact of Chain of Thought (CoT) prompting on performance. Moreover, inspired by multi-modal agents, we also explore which abilities of MLLMs need to be supplemented by specialist models.

面孔和人类是社会交互中的重要元素，广泛存在于日常照片和视频中。因此，对面孔和人类的深刻理解能使多模态助理提高响应质量和扩大应用范围。目前，多模态助理社区缺乏对面孔和人类理解能力的全面科学评估。在本文中，我们首先提出一个分层的能力分类体系，包括三个层次的能力。然后，基于这一分类体系，我们从面孔和人类社区可公开访问的数据集中收集图像和注释，并建立半自动数据管道来生成新问题，以构建新的基准测试。最后，所获得的Face-Human-Bench包括一个包含900个问题的开发集和一个包含1800个问题的测试集，支持英语和中文。我们使用Face-Human-Bench对25个主流的多模态大型语言模型（MLLMs）进行了评估，重点关注能力之间的相关性、目标相对位置对性能的影响、以及思维链（CoT）提示对性能的影响。此外，受到多模态代理的启发，我们还探索了MLLMs需要哪些能力需要由专业模型来补充。

论文及项目相关链接

PDF 50 pages, 14 figures, 41 tables. Submitted to ICLR 2025

Summary

本文提出了一个面向多模态大型语言模型（MLLMs）的Face-Human-Bench基准测试集，旨在评估模型对面部与人类理解能力的全面科学评估。该测试集包含三个级别的能力分类，采用来自面部与人类领域的公开可用数据集的图片和注释，构建了一个半自动的数据管道来生成测试问题。此外，该研究还探讨了不同模型能力之间的相关性，目标相对位置对性能的影响以及链式思维（CoT）提示对性能的影响，并探讨了MLLMs需要哪些能力需要由专业模型进行补充。

Key Takeaways

面部和人体理解在多模态交互中非常重要，能够提升智能助理的反应质量和扩大应用范围。
当前多模态助理社区缺乏对面部和人类理解能力的全面科学评估。
本文提出了一个层次化的能力分类体系，包含三个级别的能力。
基于该分类体系，收集和构建了Face-Human-Bench基准测试集，包含开发集和测试集，支持中英文。
研究评估了25个主流多模态大型语言模型（MLLMs）的能力，探讨了能力间的相关性以及目标相对位置和链式思维（CoT）提示对性能的影响。
研究发现，MLLMs在某些能力上需要由专业模型进行补充。

Cool Papers

点此查看论文截图

Authors:Alexander Brinkmann, Christian Bizer

Structured product data, in the form of attribute-value pairs, is essential for e-commerce platforms to support features such as faceted product search and attribute-based product comparison. However, vendors often provide unstructured product descriptions, making attribute value extraction necessary to ensure data consistency and usability. Large language models (LLMs) have demonstrated their potential for product attribute value extraction in few-shot scenarios. Recent research has shown that self-refinement techniques can improve the performance of LLMs on tasks such as code generation and text-to-SQL translation. For other tasks, the application of these techniques has resulted in increased costs due to processing additional tokens, without achieving any improvement in performance. This paper investigates applying two self-refinement techniques, error-based prompt rewriting and self-correction, to the product attribute value extraction task. The self-refinement techniques are evaluated across zero-shot, few-shot in-context learning, and fine-tuning scenarios using GPT-4o. The experiments show that both self-refinement techniques have only a marginal impact on the model’s performance across the different scenarios, while significantly increasing processing costs. For scenarios with training data, fine-tuning yields the highest performance, while the ramp-up costs of fine-tuning are balanced out as the amount of product descriptions increases.

结构化产品数据以属性-值对的形式存在，对于电子商务平台支持面向方面的产品搜索和基于属性的产品比较等功能至关重要。然而，供应商通常提供非结构化的产品描述，因此需要进行属性值提取以确保数据的一致性和可用性。大型语言模型（LLM）在少量场景演示了其在产品属性值提取方面的潜力。最新研究表明，自我完善技术可以提高代码生成和文本到SQL翻译等任务的性能。对于其他任务，这些技术的应用由于处理额外的令牌而增加了成本，而没有实现性能上的任何改进。本文研究了将两种自我完善技术——基于错误的提示重写和自我校正应用于产品属性值提取任务。这些自我完善技术在零样本、少量上下文学习和微调场景中，使用GPT-4o进行了评估。实验表明，这两种自我完善技术对模型在不同场景下的性能影响微乎其微，但显著增加了处理成本。对于有训练数据的场景，微调会产生最高的性能，而随着产品描述数量的增加，微调的一次性成本得到了平衡。

论文及项目相关链接

PDF

Summary

结构化产品数据以属性-值对的形式对于电子商务平台至关重要，支持面向属性搜索和属性比较等功能。然而，由于供应商提供的产品描述多为非结构化形式，因此需要进行属性值提取以确保数据的一致性和可用性。大型语言模型（LLM）已在少数场景中展现出其在产品属性值提取方面的潜力。本研究探讨了两种自我完善技术——基于错误的提示重写和自我修正——在产品属性值提取任务中的应用。在零样本、少数场景上下文学习和微调场景下使用GPT-4o对这两种自我完善技术进行评估，实验表明这些技术对模型性能的影响有限且提高了处理成本。存在训练数据时，微调可获得最佳性能。随着产品描述数量的增加，微调的初始成本得到平衡。

Key Takeaways

结构化产品数据对于电子商务平台至关重要，支持如面向属性搜索和属性比较等功能。
供应商提供的产品描述多为非结构化形式，需要进行属性值提取以确保数据的一致性和可用性。
大型语言模型（LLM）在产品属性值提取方面表现出潜力。
研究中探讨了两种自我完善技术：基于错误的提示重写和自我修正应用于产品属性值提取任务。
在零样本、少数场景和微调场景下评估这两种技术时，发现它们对模型性能的影响有限且提高了处理成本。

Cool Papers

点此查看论文截图

HetGCoT-Rec: Heterogeneous Graph-Enhanced Chain-of-Thought LLM Reasoning for Journal Recommendation

Authors:Runsong Jia, Mengjia Wu, Ying Ding, Jie Lu, Yi Zhang

Academic journal recommendation requires effectively combining structural understanding of scholarly networks with interpretable recommendations. While graph neural networks (GNNs) and large language models (LLMs) excel in their respective domains, current approaches often fail to achieve true integration at the reasoning level. We propose HetGCoT-Rec, a framework that deeply integrates heterogeneous graph transformer with LLMs through chain-of-thought reasoning. Our framework features two key technical innovations: (1) a structure-aware mechanism that transforms heterogeneous graph neural network learned subgraph information into natural language contexts, utilizing predefined metapaths to capture academic relationships, and (2) a multi-step reasoning strategy that systematically embeds graph-derived contexts into the LLM’s stage-wise reasoning process. Experiments on a dataset collected from OpenAlex demonstrate that our approach significantly outperforms baseline methods, achieving 96.48% Hit rate and 92.21% H@1 accuracy. Furthermore, we validate the framework’s adaptability across different LLM architectures, showing consistent improvements in both recommendation accuracy and explanation quality. Our work demonstrates an effective approach for combining graph-structured reasoning with language models for interpretable academic venue recommendations.

学术期刊推荐需要结合对学术网络的深刻理解和可解释的推荐。虽然图神经网络（GNNs）和大语言模型（LLMs）在各自的领域表现出色，但当前的方法往往无法真正实现在推理层面的整合。我们提出HetGCoT-Rec框架，它通过思维链推理，将异构图转换器与LLMs深度集成。我们的框架具有两项关键技术创新：（1）一种结构感知机制，它将异构图神经网络学习的子图信息转换为自然语言上下文，并利用预定义的元路径来捕捉学术关系；（2）一种多步骤推理策略，它系统地嵌入图派生的上下文到LLM的阶段推理过程中。在OpenAlex收集的数据集上进行的实验表明，我们的方法显著优于基线方法，达到了96.48%的命中率和92.21%的H@1准确率。此外，我们验证了框架在不同LLM架构中的适应性，在推荐准确性和解释质量方面均显示出持续的改进。我们的工作展示了将图结构推理与语言模型相结合进行可解释的学术期刊推荐的有效方法。

论文及项目相关链接

PDF

Summary

该文提出一种结合异质图神经网络与大型语言模型（LLM）的学术推荐框架HetGCoT-Rec。该框架通过链式思维推理实现深度整合，具有结构感知机制和多步推理策略两项关键技术创新。实验结果显示，该方法在OpenAlex数据集上显著优于基准方法，实现高命中率和高准确度。同时，该框架适应不同的LLM架构，在提高推荐准确性和解释质量方面表现一致。研究展示了结合图结构推理和语言模型的有效方法，为可解释的学术场所推荐提供了新途径。

Key Takeaways

HetGCoT-Rec框架结合了异质图神经网络和大型语言模型（LLM），实现学术推荐的系统性整合。
结构感知机制能够将异质图神经网络学习的子图信息转化为自然语言上下文。
通过预定义的元路径捕获学术关系。
多步推理策略将图衍生上下文系统地嵌入LLM的阶段性推理过程中。
实验结果表明，HetGCoT-Rec在OpenAlex数据集上的推荐效果优于其他方法，具有高命中率和准确度。
该框架适应不同的LLM架构，并在推荐准确性和解释质量方面表现一致。

Cool Papers

点此查看论文截图

Towards Interactive Deepfake Analysis

Authors:Lixiong Qin, Ning Jiang, Yang Zhang, Yuhan Qiu, Dingheng Zeng, Jiani Hu, Weihong Deng

Existing deepfake analysis methods are primarily based on discriminative models, which significantly limit their application scenarios. This paper aims to explore interactive deepfake analysis by performing instruction tuning on multi-modal large language models (MLLMs). This will face challenges such as the lack of datasets and benchmarks, and low training efficiency. To address these issues, we introduce (1) a GPT-assisted data construction process resulting in an instruction-following dataset called DFA-Instruct, (2) a benchmark named DFA-Bench, designed to comprehensively evaluate the capabilities of MLLMs in deepfake detection, deepfake classification, and artifact description, and (3) construct an interactive deepfake analysis system called DFA-GPT, as a strong baseline for the community, with the Low-Rank Adaptation (LoRA) module. The dataset and code will be made available at https://github.com/lxq1000/DFA-Instruct to facilitate further research.

现有的深度伪造分析方法主要基于判别模型，这极大地限制了其应用场景。本文旨在通过多模态大型语言模型（MLLMs）的指令微调，探索交互式深度伪造分析。这将面临数据集和基准测试缺乏以及训练效率低下等挑战。为了解决这些问题，我们引入了（1）一种GPT辅助的数据构建过程，产生了名为DFA-Instruct的指令遵循数据集，（2）一个名为DFA-Bench的基准测试，旨在全面评估MLLMs在深度伪造检测、深度伪造分类和伪造物描述方面的能力，（3）构建一个名为DFA-GPT的交互式深度伪造分析系统，作为社区的强大基线，并配备了低秩适配（LoRA）模块。数据集和代码将在https://github.com/lxq1000/DFA-Instruct上提供，以便进行进一步的研究。

论文及项目相关链接

PDF

Summary
基于当前深度伪造分析方法的局限性，本文旨在通过多模态大型语言模型（MLLMs）的指令微调，探索交互式深度伪造分析。为解决缺乏数据集和基准测试、训练效率低下等问题，本文引入了GPT辅助数据构建过程，创建了指令遵循数据集DFA-Instruct，设计了全面评估MLLMs在深度伪造检测、分类和伪造物描述能力的基准测试DFA-Bench，并构建了交互式深度伪造分析系统DFA-GPT，作为社区的强大基线，配备Low-Rank Adaptation（LoRA）模块。数据集和代码将在https://github.com/lxq1000/DFA-Instruct上提供，以方便进一步研究。

Key Takeaways

现有深度伪造分析方法主要基于判别模型，限制了其应用场景。
本文旨在通过多模态大型语言模型的指令微调，探索交互式深度伪造分析。
面临缺乏数据集和基准测试、训练效率低的挑战。
引入了GPT辅助数据构建过程，创建了指令遵循数据集DFA-Instruct。
设计了全面评估MLLMs在深度伪造检测、分类和伪造物描述能力的基准测试DFA-Bench。
构建了交互式深度伪造分析系统DFA-GPT，作为社区研究的基线。

Cool Papers

点此查看论文截图

2 OLMo 2 Furious

Authors:Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, Taira Anderson, David Atkinson, Faeze Brahman, Christopher Clark, Pradeep Dasigi, Nouha Dziri, Michal Guerquin, Hamish Ivison, Pang Wei Koh, Jiacheng Liu, Saumya Malik, William Merrill, Lester James V. Miranda, Jacob Morrison, Tyler Murray, Crystal Nam, Valentina Pyatkin, Aman Rangapur, Michael Schmitz, Sam Skjonsberg, David Wadden, Christopher Wilhelm, Michael Wilson, Luke Zettlemoyer, Ali Farhadi, Noah A. Smith, Hannaneh Hajishirzi

We present OLMo 2, the next generation of our fully open language models. OLMo 2 includes dense autoregressive models with improved architecture and training recipe, pretraining data mixtures, and instruction tuning recipes. Our modified model architecture and training recipe achieve both better training stability and improved per-token efficiency. Our updated pretraining data mixture introduces a new, specialized data mix called Dolmino Mix 1124, which significantly improves model capabilities across many downstream task benchmarks when introduced via late-stage curriculum training (i.e. specialized data during the annealing phase of pretraining). Finally, we incorporate best practices from T"ulu 3 to develop OLMo 2-Instruct, focusing on permissive data and extending our final-stage reinforcement learning with verifiable rewards (RLVR). Our OLMo 2 base models sit at the Pareto frontier of performance to compute, often matching or outperforming open-weight only models like Llama 3.1 and Qwen 2.5 while using fewer FLOPs and with fully transparent training data, code, and recipe. Our fully open OLMo 2-Instruct models are competitive with or surpassing open-weight only models of comparable size, including Qwen 2.5, Llama 3.1 and Gemma 2. We release all OLMo 2 artifacts openly – models at 7B and 13B scales, both pretrained and post-trained, including their full training data, training code and recipes, training logs and thousands of intermediate checkpoints. The final instruction model is available on the Ai2 Playground as a free research demo.

我们推出了OLMo 2，这是我们完全开放式语言模型的下一代产品。OLMo 2包括具有改进架构和训练方案的密集自回归模型、预训练数据混合以及指令调整方案。我们的改进模型架构和训练方案实现了更好的训练稳定性和每标记符的效率提升。我们更新的预训练数据混合引入了一种新的专门数据混合称为Dolmino Mix 1124，通过后期课程训练（即在预训练的退火阶段引入专门数据）引入时，它能在许多下游任务基准测试中显著提高模型能力。最后，我们结合了Tulu 3的最佳实践来开发OLMo 2-Instruct，专注于许可数据，并扩展我们最终阶段的强化学习可验证奖励（RLVR）。我们的OLMo 2基础模型位于性能计算前沿，通常与Llama 3.1和Qwen 2.5等公开权重模型相匹配或表现更佳，同时使用更少的浮点运算，并具有完全透明的训练数据、代码和方案。我们的完全开放式OLMo 2-Instruct模型与相同规模的公开权重模型竞争或表现更佳，包括Qwen 2.5、Llama 3.1和Gemma 2。我们公开所有OLMo 2制品——规模为7B和13B的模型，包括预训练和后期训练、完整的训练数据、训练代码和方案、训练日志以及数千个中间检查点。最终指令模型可在Ai2游乐场作为免费研究演示使用。

论文及项目相关链接

PDF Model demo available at playground.allenai.org

摘要

OLMo 2作为全新一代开源语言模型发布，具有改进后的架构和训练方案、预训练数据混合以及指令调整配方。新模型架构和训练方案提高了训练稳定性和每标记的效率和性能。更新的预训练数据混合引入了名为Dolmino Mix 1124的专门数据混合，通过晚期课程训练（即在预训练的退火阶段引入特定数据）在多个下游任务基准测试中显著提高了模型能力。最后，结合图卢3的最佳实践，开发OLMo 2-Instruct，专注于许可数据，并利用可验证奖励扩展最终阶段的强化学习。OLMo 2基础模型在计算性能上处于前沿，往往能匹配或超越如Llama 3.1和Qwen 2.5等仅开源权重模型，同时减少浮点运算使用量并公开透明化训练数据、代码和配方。完全开源的OLMo 2-Instruct模型在规模相当的情况下与Qwen 2.5、Llama 3.1和Gemma 2等模型竞争，并表现优异。我们公开所有OLMo 2工件——包括规模为7B和13B的预训练和后续训练模型、完整的训练数据、训练代码和配方、训练日志以及数千个中间检查点。最终指令模型可在Ai2 Playground上作为免费研究演示使用。

要点

OLMo 2是新一代开源语言模型，具有改进的架构和训练方案。
引入新的预训练数据混合方法Dolmino Mix 1124，通过晚期课程训练提高模型能力。
结合图卢3的最佳实践，开发OLMo 2-Instruct模型，注重许可数据的使用和强化学习的可验证奖励。
OLMo 2基础模型性能优越，与其他开源模型相比具有更高的效率和性能。
完全开源的OLMo 2-Instruct模型具有竞争力，可在规模相当的情况下与其他优秀模型相媲美。
所有OLMo 2相关工件均公开发布，包括模型、数据、代码和日志等。

Cool Papers

点此查看论文截图

Online Video Understanding: A Comprehensive Benchmark and Memory-Augmented Method

Authors:Zhenpeng Huang, Xinhao Li, Jiaqi Li, Jing Wang, Xiangyu Zeng, Cheng Liang, Tao Wu, Xi Chen, Liang Li, Limin Wang

Multimodal Large Language Models (MLLMs) have shown significant progress in offline video understanding. However, applying these models to real-world scenarios, such as autonomous driving and human-computer interaction, presents unique challenges due to the need for real-time processing of continuous online video streams. To this end, this paper presents systematic efforts from three perspectives: evaluation benchmark, model architecture, and training strategy. First, we introduce OVBench, a comprehensive question-answering benchmark specifically designed to evaluate models’ ability to perceive, memorize, and reason within online video contexts. It features six core task types across three temporal contexts-past, present, and future-forming 16 subtasks from diverse datasets. Second, we propose a new Pyramid Memory Bank (PMB) that effectively retains key spatiotemporal information in video streams. Third, we proposed an offline-to-online learning paradigm, designing an interleaved dialogue format for online video data and constructing an instruction-tuning dataset tailored for online video training. This framework led to the development of VideoChat-Online, a robust and efficient model for online video understanding. Despite the lower computational cost and higher efficiency, VideoChat-Online outperforms existing state-of-the-art offline and online models across popular offline video benchmarks and OVBench, demonstrating the effectiveness of our model architecture and training strategy.

多模态大型语言模型（MLLMs）在离线视频理解方面取得了显著进展。然而，将这些模型应用于自动驾驶和人机交互等现实场景，由于需要对连续的在线视频流进行实时处理，面临着独特的挑战。为此，本文从评估基准、模型架构和训练策略三个方面进行了系统的努力。首先，我们介绍了OVBench，这是一个专门设计用于评估模型在在线视频上下文中的感知、记忆和推理能力的综合问答基准。它涵盖了过去、现在和未来的三个时间上下文中的六种核心任务类型，形成了16个子任务，来源于各种数据集。其次，我们提出了一种新的金字塔记忆库（PMB），它能有效地保留视频流中的关键时空信息。第三，我们提出了从离线到在线的学习范式，设计了适合在线视频数据的交互式对话格式，并构建了一个针对在线视频训练的指令调整数据集。这一框架催生了VideoChat-Online的诞生，这是一个用于在线视频理解的稳健而高效的模型。尽管计算成本较低，效率较高，VideoChat-Online在流行的离线视频基准测试和OVBench上的表现都优于现有的先进离线和在线模型，这证明了我们的模型架构和训练策略的有效性。

论文及项目相关链接

PDF

Summary

基于多模态大型语言模型（MLLMs）在离线视频理解方面的显著进步，本文围绕在线视频流实时处理的需求，从评估基准、模型架构和训练策略三个方面进行了系统努力。介绍了专门为在线视频语境设计的问答评估基准OVBench，提出了有效保留视频流中关键时空信息的Pyramid Memory Bank（PMB），以及适用于在线视频数据的离线到在线学习范式。这些成果最终促成了VideoChat-Online模型的诞生，该模型在离线视频基准测试和OVBench上均表现出卓越性能，且计算成本低、效率高。

Key Takeaways

MLLMs在离线视频理解方面取得显著进展，但在处理连续在线视频流方面面临挑战。
引入新的问答评估基准OVBench，涵盖在线视频语境中的感知、记忆和推理能力评估。
提出Pyramid Memory Bank（PMB），有效保留视频流中的关键时空信息。
采用离线到在线的学习范式，适应在线视频数据，构建针对性的训练数据集。
VideoChat-Online模型具备稳健性和高效性，适用于在线视频理解。
VideoChat-Online在离线视频基准测试和OVBench上表现出优异性能，优于现有主流模型。

Cool Papers

点此查看论文截图

VisTabNet: Adapting Vision Transformers for Tabular Data

Authors:Witold Wydmański, Ulvi Movsum-zada, Jacek Tabor, Marek Śmieja

Although deep learning models have had great success in natural language processing and computer vision, we do not observe comparable improvements in the case of tabular data, which is still the most common data type used in biological, industrial and financial applications. In particular, it is challenging to transfer large-scale pre-trained models to downstream tasks defined on small tabular datasets. To address this, we propose VisTabNet – a cross-modal transfer learning method, which allows for adapting Vision Transformer (ViT) with pre-trained weights to process tabular data. By projecting tabular inputs to patch embeddings acceptable by ViT, we can directly apply a pre-trained Transformer Encoder to tabular inputs. This approach eliminates the conceptual cost of designing a suitable architecture for processing tabular data, while reducing the computational cost of training the model from scratch. Experimental results on multiple small tabular datasets (less than 1k samples) demonstrate VisTabNet’s superiority, outperforming both traditional ensemble methods and recent deep learning models. The proposed method goes beyond conventional transfer learning practice and shows that pre-trained image models can be transferred to solve tabular problems, extending the boundaries of transfer learning.

尽管深度学习模型在自然语言处理和计算机视觉领域取得了巨大成功，我们并没有观察到在表格数据的情况下有相应的改进，而表格数据仍然是生物、工业和财务应用中最常用的数据类型。特别是，将大规模预训练模型转移到在小型表格数据集上定义的下游任务是一项具有挑战性的任务。为了解决这一问题，我们提出了VisTabNet——一种跨模态迁移学习方法，它允许使用预训练权重的Vision Transformer（ViT）来处理表格数据。通过将表格输入投影到ViT可接受的补丁嵌入，我们可以直接将预训练的Transformer编码器应用于表格输入。这种方法消除了为处理表格数据而设计合适架构的概念成本，同时降低了从头开始训练模型的计算成本。在多个小型表格数据集（少于1k个样本）上的实验结果证明了VisTabNet的优越性，它超越了传统的集成方法和最新的深度学习模型。所提出的方法超越了传统的迁移学习实践，表明预训练的图像模型可以转移到解决表格问题，扩展了迁移学习的边界。

论文及项目相关链接

PDF

Summary
深學模型在自然語言處理和計算機視覺方面取得巨大成功，但在處理表格數據時並未出現類似進展，而表格數據仍在生物、工业和金融應用中是最常使用的數據類型。本文提出VisTabNet——一種跨模態轉移學習方法，可將預訓練的Vision Transformer（ViT）模型適應於處理表格數據。通過將表格輸入投影至適應ViT的patch embeddings，可直接將預訓練的Transformer Encoder應用於表格數據。此做法免除了為處理表格數據而設計適應性結構的概念成本，並降低了從零開始訓練模型的計算成本。在數個小型表格數據集（樣本數少於1k）上的實驗結果顯示VisTabNet的優越性，超越了傳統集成方法和最新的深度學習模型。此方法突破了傳統轉移學習的範式，證明預訓練的圖像模型可轉移至解決表格問題，擴展了轉移學習的界限。

Key Takeaways

深度學習模型在自然語言處理和計算機視覺方面表現出色，但在處理表格數據時改善有限。
表格數據在生物、工业和金融應用中廣為使用。
VisTabNet提出一種跨模態轉移學習方法，將Vision Transformer（ViT）應用於處理表格數據。
VisTabNet通過將表格輸入轉換為patch embeddings，適應預訓練的Transformer Encoder。
此方法簡化了為處理表格數據而設計特定結構的需求，並降低了模型訓練成本。
在小型表格數據集上的實驗顯示VisTabNet優於傳統集成方法和深度學習模型。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-01-06/LLM/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

LLM

Agent

Agent 方向最新论文已更新，请持续关注 Update in 2025-01-06 A3 Android Agent Arena for Mobile GUI Agents

2025-01-06 Agent

Agent

LLM

LLM 方向最新论文已更新，请持续关注 Update in 2025-01-05 RLAIF-V Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness

2025-01-05 LLM

LLM

LLM

2025-01-06 更新

Unifying Specialized Visual Encoders for Video Language Models

OmniChat: Enhancing Spoken Dialogue Systems with Scalable Synthetic Data for Diverse Scenarios

Aligning Large Language Models for Faithful Integrity Against Opposing Argument

CySecBench: Generative AI-based CyberSecurity-focused Prompt Dataset for Benchmarking Large Language Models

The Prompt Alchemist: Automated LLM-Tailored Prompt Optimization for Test Case Generation

Think More, Hallucinate Less: Mitigating Hallucinations via Dual Process of Fast and Slow Thinking

Citations and Trust in LLM Generated Responses

CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings

Digital Guardians: Can GPT-4, Perspective API, and Moderation API reliably detect hate speech in reader comments of German online newspapers?

SeFAR: Semi-supervised Fine-grained Action Recognition with Temporal Perturbation and Learning Stabilization

Face-Human-Bench: A Comprehensive Benchmark of Face and Human Understanding for Multi-modal Assistants

Automated Self-Refinement and Self-Correction for LLM-based Product Attribute Value Extraction

HetGCoT-Rec: Heterogeneous Graph-Enhanced Chain-of-Thought LLM Reasoning for Journal Recommendation

Towards Interactive Deepfake Analysis

2 OLMo 2 Furious

Online Video Understanding: A Comprehensive Benchmark and Memory-Augmented Method

VisTabNet: Adapting Vision Transformers for Tabular Data