发布日期: 2025-01-16

更新日期: 2025-01-16

文章字数: 20.6k

阅读时长: 84 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-01-16 更新

Text-Diffusion Red-Teaming of Large Language Models: Unveiling Harmful Behaviors with Proximity Constraints

Authors:Jonathan Nöther, Adish Singla, Goran Radanović

Recent work has proposed automated red-teaming methods for testing the vulnerabilities of a given target large language model (LLM). These methods use red-teaming LLMs to uncover inputs that induce harmful behavior in a target LLM. In this paper, we study red-teaming strategies that enable a targeted security assessment. We propose an optimization framework for red-teaming with proximity constraints, where the discovered prompts must be similar to reference prompts from a given dataset. This dataset serves as a template for the discovered prompts, anchoring the search for test-cases to specific topics, writing styles, or types of harmful behavior. We show that established auto-regressive model architectures do not perform well in this setting. We therefore introduce a black-box red-teaming method inspired by text-diffusion models: Diffusion for Auditing and Red-Teaming (DART). DART modifies the reference prompt by perturbing it in the embedding space, directly controlling the amount of change introduced. We systematically evaluate our method by comparing its effectiveness with established methods based on model fine-tuning and zero- and few-shot prompting. Our results show that DART is significantly more effective at discovering harmful inputs in close proximity to the reference prompt.

最近的工作提出了针对给定的大型语言模型（LLM）漏洞测试的自动化红队方法。这些方法使用红队LLM来发现会在目标LLM中引发有害行为的输入。在本文中，我们研究了能够进行有针对性的安全评估的红队策略。我们提出了带有接近度约束的红队优化框架，其中发现的提示必须与给定数据集中的参考提示相似。该数据集作为发现的提示的模板，将测试用例搜索锚定到特定主题、写作风格或有害行为类型上。我们表明，现有的自动回归模型架构在此设置中表现不佳。因此，我们引入了一种受文本扩散模型启发的黑盒红队方法：用于审计和红队的扩散（DART）。DART通过嵌入空间中对其进行扰动来修改参考提示，直接控制引入的更改量。我们通过将其与基于模型微调以及零样本和少样本提示的现有方法进行比较，系统地评估了我们的方法的有效性。结果表明，DART在发现与参考提示接近的有害输入方面显著更有效。

论文及项目相关链接

PDF This is an extended version of a paper published at AAAI 25

Summary

基于大型语言模型（LLM）的自动红队测试方法，旨在发现针对目标LLM产生有害行为的输入。本文研究了使目标安全评估成为可能的红队策略，并提出了一个带有近似约束的红队优化框架。此外，本文发现现有的自动回归模型架构在此设置中表现不佳，因此引入了一种基于文本扩散模型的黑色盒子红队方法：审计和红队扩散（DART）。通过系统地评估，结果表明DART在发现与参考提示相近的有害输入方面显著更有效。

Key Takeaways

自动化红队方法用于测试大型语言模型（LLM）的漏洞。
红队策略旨在发现导致目标LLM产生有害行为的输入。
提出了一个带有近似约束的红队优化框架，参考提示作为模板，使搜索测试用例集中于特定主题、写作风格或有害行为类型。
现有自动回归模型架构在此设置中表现不佳。
引入了一种基于文本扩散模型的黑色盒子红队方法：DART（审计和红队扩散）。
DART通过扰动参考提示的嵌入空间进行修改，直接控制引入的变化量。

Cool Papers

点此查看论文截图

LeapVAD: A Leap in Autonomous Driving via Cognitive Perception and Dual-Process Thinking

Authors:Yukai Ma, Tiantian Wei, Naiting Zhong, Jianbiao Mei, Tao Hu, Licheng Wen, Xuemeng Yang, Botian Shi, Yong Liu

While autonomous driving technology has made remarkable strides, data-driven approaches still struggle with complex scenarios due to their limited reasoning capabilities. Meanwhile, knowledge-driven autonomous driving systems have evolved considerably with the popularization of visual language models. In this paper, we propose LeapVAD, a novel method based on cognitive perception and dual-process thinking. Our approach implements a human-attentional mechanism to identify and focus on critical traffic elements that influence driving decisions. By characterizing these objects through comprehensive attributes - including appearance, motion patterns, and associated risks - LeapVAD achieves more effective environmental representation and streamlines the decision-making process. Furthermore, LeapVAD incorporates an innovative dual-process decision-making module miming the human-driving learning process. The system consists of an Analytic Process (System-II) that accumulates driving experience through logical reasoning and a Heuristic Process (System-I) that refines this knowledge via fine-tuning and few-shot learning. LeapVAD also includes reflective mechanisms and a growing memory bank, enabling it to learn from past mistakes and continuously improve its performance in a closed-loop environment. To enhance efficiency, we develop a scene encoder network that generates compact scene representations for rapid retrieval of relevant driving experiences. Extensive evaluations conducted on two leading autonomous driving simulators, CARLA and DriveArena, demonstrate that LeapVAD achieves superior performance compared to camera-only approaches despite limited training data. Comprehensive ablation studies further emphasize its effectiveness in continuous learning and domain adaptation. Project page: https://pjlab-adg.github.io/LeapVAD/.

虽然自动驾驶技术在近年来取得了显著的进步，但数据驱动的方法在复杂场景下仍然面临着挑战，因为其推理能力有限。与此同时，随着视觉语言模型的普及，知识驱动的自动驾驶系统有了显著的发展。在本文中，我们提出了一种基于认知感知和双过程思考的新型方法LeapVAD。我们的方法实现了人类注意力机制，用于识别和关注影响驾驶决策的关键交通要素。通过对这些物体进行全面的特征描述，包括外观、运动模式和相关风险，LeapVAD实现了更有效的环境表征和简化的决策过程。此外，LeapVAD融入了一个创新的双过程决策模块，模仿人类驾驶学习过程。该系统包括一个分析过程（系统II），通过逻辑推理积累驾驶经验，以及一个启发式过程（系统I），通过微调和小样本学习来完善知识。LeapVAD还包括反射机制和不断增长的记忆库，使其能够从过去的错误中学习并在闭环环境中持续提高其性能。为了提高效率，我们开发了一个场景编码器网络，用于生成紧凑的场景表示，以便快速检索相关的驾驶经验。在CARLA和DriveArena两个领先的自动驾驶模拟器上进行的广泛评估表明，尽管训练数据有限，LeapVAD仍优于仅使用摄像头的方案并实现了卓越的性能。全面的消融研究进一步强调了其在持续学习和域适应方面的有效性。项目页面：https://pjlab-adg.github.io/LeapVAD/。

论文及项目相关链接

PDF

Summary：

本文介绍了LeapVAD这一新型的基于认知感知和双过程思维的自动驾驶技术。它通过引入人类注意力机制，对影响驾驶决策的关键交通元素进行识别和聚焦，实现了更高效的环境表征和决策过程。LeapVAD结合了创新性的双过程决策模块，模仿人类驾驶学习过程，包括分析过程（系统-II）和经验积累以及通过微调和小样本学习的启发式过程（系统-I）。此外，LeapVAD还具有反思机制和成长记忆库，可从过去的错误中学习并持续改进性能。在两大主流自动驾驶模拟器CARLA和DriveArena上的评估表明，LeapVAD在有限训练数据下实现了超越仅使用摄像头的优越性能。

Key Takeaways：

LeapVAD结合认知感知和双过程思维，针对自动驾驶技术提出新方法。
通过引入人类注意力机制，实现对关键交通元素的识别和聚焦。
LeapVAD模仿人类驾驶学习过程，包含分析过程和启发式过程。
系统具有反思机制和成长记忆库，可从错误中学习并持续改进。
在两大模拟器上的评估显示，LeapVAD在有限数据下表现优越。
LeapVAD通过场景编码器网络生成紧凑的场景表示，提高效率。
综合消融研究强调了其在持续学习和域适应方面的有效性。

Cool Papers

点此查看论文截图

I Can Find You in Seconds! Leveraging Large Language Models for Code Authorship Attribution

Authors:Soohyeon Choi, Yong Kiam Tan, Mark Huasong Meng, Mohamed Ragab, Soumik Mondal, David Mohaisen, Khin Mi Mi Aung

Source code authorship attribution is important in software forensics, plagiarism detection, and protecting software patch integrity. Existing techniques often rely on supervised machine learning, which struggles with generalization across different programming languages and coding styles due to the need for large labeled datasets. Inspired by recent advances in natural language authorship analysis using large language models (LLMs), which have shown exceptional performance without task-specific tuning, this paper explores the use of LLMs for source code authorship attribution. We present a comprehensive study demonstrating that state-of-the-art LLMs can successfully attribute source code authorship across different languages. LLMs can determine whether two code snippets are written by the same author with zero-shot prompting, achieving a Matthews Correlation Coefficient (MCC) of 0.78, and can attribute code authorship from a small set of reference code snippets via few-shot learning, achieving MCC of 0.77. Additionally, LLMs show some adversarial robustness against misattribution attacks. Despite these capabilities, we found that naive prompting of LLMs does not scale well with a large number of authors due to input token limitations. To address this, we propose a tournament-style approach for large-scale attribution. Evaluating this approach on datasets of C++ (500 authors, 26,355 samples) and Java (686 authors, 55,267 samples) code from GitHub, we achieve classification accuracy of up to 65% for C++ and 68.7% for Java using only one reference per author. These results open new possibilities for applying LLMs to code authorship attribution in cybersecurity and software engineering.

源代码作者归属在软件取证、抄袭检测以及软件补丁完整性保护方面具有重要意义。现有技术通常依赖于有监督机器学习，由于需要大量标注数据集，它在不同编程语言和编码风格之间的泛化方面表现困难。受自然语言作者分析中使用大型语言模型（LLMs）的最新进展的启发，该论文探索了使用LLMs进行源代码作者归属。我们进行了全面的研究，表明最先进的LLMs可以成功地在不同语言中进行源代码作者归属。LLMs可以通过零样本提示确定两个代码片段是否由同一作者编写，达到马修斯相关系数（MCC）0.78，并且可以从少量参考代码片段中通过小样本学习进行代码作者归属，达到MCC 0.77。此外，LLMs对于一些误归属攻击还表现出一定的对抗性稳健性。尽管具备了这些功能，但我们发现对LLMs的直白提示并不适用于大量作者的情况，因为存在输入标记的限制。为了解决这一问题，我们提出了一种用于大规模归属的锦标赛式方法。在GitHub上的C++（500位作者，26,355个样本）和Java（686位作者，55,267个样本）数据集上评估该方法，仅使用每位作者的一个参考样本，即可达到C++的65%和Java的68.7%的分类准确率。这些结果为在网络安全和软件工程中应用LLMs进行代码作者归属提供了新的可能性。

论文及项目相关链接

PDF 12 pages, 5 figures,

摘要
基于大型语言模型（LLMs）的自然语言作者分析最新进展的启发，本文探索了用于源代码作者归属性的LLMs应用。研究表明，先进LLMs可成功归属不同语言的源代码作者，通过零样本提示确定两个代码片段是否由同一作者撰写，达到马修斯相关系数（MCC）为0.78，并通过少量学习从参考代码片段中归属作者，实现MCC为0.77。尽管具有这些能力，但发现LLMs的直观提示并不适用于大量作者的场景。为解决此问题，本文提出了一种锦标赛风格的大规模归属方法，在GitHub的C++和Java代码数据集上实现高达65%和68.7%的分类准确率。这为在网络安全和软件工程中应用LLMs进行代码作者归属提供了新的可能性。

关键见解

源代码作者归属在软件取证、抄袭检测和软件补丁完整性保护中具有重要性。
现有技术通常依赖于监督机器学习，难以在不同编程语言和编码风格之间进行泛化，需要大规模标注数据集。
LLMs在源代码作者归属方面表现出色，可成功归因于不同语言的作者。
通过零样本提示和少量学习，LLMs可以准确地确定代码片段的作者。
LLMs具有一定的对抗误归属攻击的能力。
直观提示LLMs并不适用于大量作者的场景。

Cool Papers

点此查看论文截图

Inference-Time-Compute: More Faithful? A Research Note

Authors:James Chua, Owain Evans

Models trained specifically to generate long Chains of Thought (CoTs) have recently achieved impressive results. We refer to these models as Inference-Time-Compute (ITC) models. Are the CoTs of ITC models more faithful compared to traditional non-ITC models? We evaluate two ITC models (based on Qwen-2.5 and Gemini-2) on an existing test of faithful CoT To measure faithfulness, we test if models articulate cues in their prompt that influence their answers to MMLU questions. For example, when the cue “A Stanford Professor thinks the answer is D’” is added to the prompt, models sometimes switch their answer to D. In such cases, the Gemini ITC model articulates the cue 54% of the time, compared to 14% for the non-ITC Gemini. We evaluate 7 types of cue, such as misleading few-shot examples and anchoring on past responses. ITC models articulate cues that influence them much more reliably than all the 6 non-ITC models tested, such as Claude-3.5-Sonnet and GPT-4o, which often articulate close to 0% of the time. However, our study has important limitations. We evaluate only two ITC models – we cannot evaluate OpenAI’s SOTA o1 model. We also lack details about the training of these ITC models, making it hard to attribute our findings to specific processes. We think faithfulness of CoT is an important property for AI Safety. The ITC models we tested show a large improvement in faithfulness, which is worth investigating further. To speed up this investigation, we release these early results as a research note.

近期，经过专门训练以生成长链条思维（Chain of Thoughts, CoT）的模型取得了令人印象深刻的结果。我们将这些模型称为推理时间计算（Inference-Time-Compute, ITC）模型。与传统的非ITC模型相比，ITC模型的思维链条是否更加忠实？我们对两款基于Qwen-2.5和Gemini-2的ITC模型进行了忠诚性的现有测试。为了衡量思维链条的忠实性，我们测试了模型是否能明确其提示中的线索来影响其答案到MMLU问题的方向。例如，当提示中加入“斯坦福教授认为答案是D”这一线索时，模型有时会改变答案选择D。在这种情况下，Gemini ITC模型能够明确这一线索的占比达到54%，而非ITC的Gemini仅占14%。我们对包括误导性的少量示例和基于过去回应的锚定在内的7种线索进行了评估。ITC模型更加可靠地描述了影响他们的线索，相比于我们测试的六个非ITC模型，如Claude-3.5-Sonnet和GPT-4o等，它们通常明确表述的比例接近为0%。然而，我们的研究存在重要局限性。我们只评估了两个ITC模型，无法评估OpenAI的最新模型o1。此外，由于缺乏关于这些ITC模型的训练细节，这使得我们的发现难以归因于特定的过程。我们认为思维链条的忠实性是AI安全性的一个重要属性。我们测试的ITC模型在忠实性方面取得了很大改进，值得进一步调查。为了加速这一调查进程，我们发布这些早期结果作为研究笔记。

论文及项目相关链接

PDF 7 pages, 5 figures

Summary

本文介绍了针对生成长链条思维（CoT）的模型，特别是Inference-Time-Compute（ITC）模型的信仰度测试。通过对两款ITC模型（基于Qwen-2.5和Gemini-2）的评估，发现它们相比传统非ITC模型更可靠地表达影响答案的线索。然而，研究存在局限性，仅评估了两个ITC模型，且缺乏关于这些ITC模型训练的具体细节。尽管如此，ITC模型在信仰度方面显示出重大改进，值得进一步调查。

Key Takeaways

ITC模型被训练用于生成长链条思维（CoT），在信仰度测试中表现优异。
与传统非ITC模型相比，ITC模型更可靠地表达影响答案的线索。
在评估的7种线索中，ITC模型比6款非ITC模型更可靠地表达这些线索。
研究仅评估了两个ITC模型，存在局限性。
缺乏关于ITC模型训练的具体细节，使得难以将发现归因于特定过程。
信仰度是AI安全的重要属性，ITC模型在信仰度方面的改进值得进一步调查。

Cool Papers

点此查看论文截图

Self-Instruct Few-Shot Jailbreaking: Decompose the Attack into Pattern and Behavior Learning

Authors:Jiaqi Hua, Wanxu Wei

Recently, several works have been conducted on jailbreaking Large Language Models (LLMs) with few-shot malicious demos. In particular, Zheng et al. (2024) focuses on improving the efficiency of Few-Shot Jailbreaking (FSJ) by injecting special tokens into the demos and employing demo-level random search. Nevertheless, this method lacks generality since it specifies the instruction-response structure. Moreover, the reason why inserting special tokens takes effect in inducing harmful behaviors is only empirically discussed. In this paper, we take a deeper insight into the mechanism of special token injection and propose Self-Instruct Few-Shot Jailbreaking (Self-Instruct-FSJ) facilitated with the demo-level greedy search. This framework decomposes the FSJ attack into pattern and behavior learning to exploit the model’s vulnerabilities in a more generalized and efficient way. We conduct elaborate experiments to evaluate our method on common open-source models and compare it with baseline algorithms. Our code is available at https://github.com/iphosi/Self-Instruct-FSJ.

最近，有一些关于利用少量恶意演示破解大型语言模型（LLM）的研究。特别是，Zheng等人（2024年）专注于通过向演示中注入特殊令牌并采用演示级随机搜索来提高少样本破解（FSJ）的效率。然而，这种方法缺乏通用性，因为它规定了指令-响应结构。此外，插入特殊令牌为何能有效诱导有害行为只是进行了经验性讨论。在本文中，我们对特殊令牌注入的机制进行了更深入的研究，并提出了借助演示级贪婪搜索的辅助进行自我指导少样本破解（Self-Instruct FSJ）。该框架将FSJ攻击分解为模式学习和行为学习，以更通用和高效的方式利用模型的漏洞。我们在常见的开源模型上进行了精心设计的实验来评估我们的方法，并将其与基线算法进行了比较。我们的代码位于https://github.com/iphosi/Self-Instruct-FSJ。

论文及项目相关链接

PDF

Summary

大型语言模型的少样本越狱攻击研究近期备受关注。本文深入探讨了特殊令牌注入的机制，并提出了基于演示级贪婪搜索的Self-Instruct少样本越狱攻击框架。该框架将越狱攻击分解为模式和行为学习，以更通用和高效的方式利用模型的漏洞。实验证明，该方法在开源模型上的表现优于基线算法。相关代码已公开在GitHub上。

Key Takeaways

当前对大型语言模型的少样本越狱攻击研究活跃。
Zheng等人的研究通过插入特殊令牌提高了越狱效率，但缺乏通用性。
本文深入探讨了特殊令牌注入的机制，并提出了新的Self-Instruct少样本越狱攻击框架。
该框架通过分解越狱攻击为模式和行为学习，以更通用和高效的方式利用模型漏洞。
实验证明，新框架在开源模型上的表现优于现有方法。
研究的代码已公开在GitHub上，便于他人查阅和使用。
该研究为大型语言模型的安全性和稳定性带来了新的挑战和研究方向。

Cool Papers

点此查看论文截图

Large Language Model Interface for Home Energy Management Systems

Authors:François Michelon, Yihong Zhou, Thomas Morstyn

Home Energy Management Systems (HEMSs) help households tailor their electricity usage based on power system signals such as energy prices. This technology helps to reduce energy bills and offers greater demand-side flexibility that supports the power system stability. However, residents who lack a technical background may find it difficult to use HEMSs effectively, because HEMSs require well-formatted parameterization that reflects the characteristics of the energy resources, houses, and users’ needs. Recently, Large-Language Models (LLMs) have demonstrated an outstanding ability in language understanding. Motivated by this, we propose an LLM-based interface that interacts with users to understand and parameterize their ``badly-formatted answers’’, and then outputs well-formatted parameters to implement an HEMS. We further use Reason and Act method (ReAct) and few-shot prompting to enhance the LLM performance. Evaluating the interface performance requires multiple user–LLM interactions. To avoid the efforts in finding volunteer users and reduce the evaluation time, we additionally propose a method that uses another LLM to simulate users with varying expertise, ranging from knowledgeable to non-technical. By comprehensive evaluation, the proposed LLM-based HEMS interface achieves an average parameter retrieval accuracy of 88%, outperforming benchmark models without ReAct and/or few-shot prompting.

家庭能源管理系统（HEMS）能够根据电力系统信号（如能源价格）帮助家庭调整其用电行为。这项技术有助于降低能源账单，并提供更大的需求侧灵活性，支持电力系统稳定性。然而，缺乏技术背景的居民可能会发现有效使用HEMS很困难，因为HEMS需要反映能源资源、房屋和用户需求的特性的格式化参数。最近，大型语言模型（LLM）在语言理解方面表现出了出色的能力。因此，我们提出了一种基于LLM的接口，该接口与用户互动，理解和参数化他们的“格式错误的答案”，然后输出格式良好的参数来实现HEMS。我们进一步使用Reason and Act方法（ReAct）和少量提示来增强LLM的性能。评估接口性能需要进行多次用户与LLM的互动。为了避免寻找志愿者用户和减少评估时间，我们还提出了一种方法，即使用另一个LLM来模拟具有不同专业知识水平的用户，从知识渊博到非技术。通过全面评估，所提出的基于LLM的HEMS接口平均参数检索准确率为88%，优于没有ReAct和/或少量提示的基准模型。

论文及项目相关链接

PDF 13 pages conference paper

Summary：
家庭能源管理系统（HEMS）可根据电力系统信号如能源价格定制家庭用电。该技术有助于降低能源账单，并为电力系统稳定性提供更大的需求侧灵活性。然而，缺乏技术背景的居民可能难以有效使用HEMS。最近，大型语言模型（LLM）展示了出色的语言理解能力。基于此，提出一种基于LLM的接口，与用户交互以理解和参数化他们的“格式错误的答案”，并输出格式良好的参数来实现HEMS。通过采用Reason and Act方法（ReAct）和少量提示增强LLM性能。评估接口性能需要多次用户与LLM的互动。为了避免寻找志愿者用户和减少评估时间，我们提出了一种使用另一个LLM模拟不同专业程度的用户的方法，从知识渊博到非技术用户。评估结果显示，所提出的基于LLM的HEMS接口平均参数检索准确率为88%，优于未使用ReAct和/或少量提示的基准模型。

Key Takeaways:

家庭能源管理系统(HEMS)能根据电力信号调整电力使用，有助于减少能源账单和支持电力系统稳定性。
大型语言模型（LLM）在理解和参数化居民对HEMS的使用方面展现出潜力。
提出了基于LLM的接口，通过用户交互理解并参数化“格式错误的答案”。
采用Reason and Act方法（ReAct）和少量提示增强LLM性能，提高HEMS接口的性能。
使用另一个LLM模拟不同专业程度的用户，从知识渊博到非技术用户，便于接口性能评估。
综合评估显示，基于LLM的HEMS接口平均参数检索准确率为88%。
该接口性能优于未使用ReAct和/或少量提示的基准模型。

Cool Papers

点此查看论文截图

Leveraging Metamemory Mechanisms for Enhanced Data-Free Code Generation in LLMs

Authors:Shuai Wang, Liang Ding, Yibing Zhan, Yong Luo, Zheng He, Dapeng Tao

Automated code generation using large language models (LLMs) has gained attention due to its efficiency and adaptability. However, real-world coding tasks or benchmarks like HumanEval and StudentEval often lack dedicated training datasets, challenging existing few-shot prompting approaches that rely on reference examples. Inspired by human metamemory-a cognitive process involving recall and evaluation-we present a novel framework (namely M^2WF) for improving LLMs’ one-time code generation. This approach enables LLMs to autonomously generate, evaluate, and utilize synthetic examples to enhance reliability and performance. Unlike prior methods, it minimizes dependency on curated data and adapts flexibly to various coding scenarios. Our experiments demonstrate significant improvements in coding benchmarks, offering a scalable and robust solution for data-free environments. The code and framework will be publicly available on GitHub and HuggingFace.

使用大型语言模型（LLM）进行自动代码生成已经因其效率和适应性而受到关注。然而，现实世界中的编码任务或HumanEval和学生评估（StudentEval）等基准测试通常缺乏专用的训练数据集，这给依赖参考样本的现有少数提示方法带来了挑战。受人类元记忆（一种涉及回忆和评价的认知过程）的启发，我们提出了一种改进LLM一次性代码生成的新型框架（即M^2WF）。这种方法使LLM能够自主生成、评估和利用合成示例，以提高可靠性和性能。不同于以前的方法，它最大限度地减少了对数据集整理的依赖，并灵活地适应各种编码场景。我们的实验表明，在编码基准测试中取得了显著的改进，为无数据环境提供了可伸缩和稳健的解决方案。代码和框架将在GitHub和HuggingFace上公开可用。

论文及项目相关链接

PDF 11 pages,6 figures

Summary

基于大型语言模型（LLM）的自动化代码生成因其效率和适应性而受到关注。然而，现实世界的编程任务或基准测试（如HumanEval和StudentEval）往往缺乏专用的训练数据集，这给依赖参考样例的现有少样本提示方法带来了挑战。受人类元记忆（一种涉及回忆和评估的认知过程）的启发，我们提出了一种名为M^2WF的新型框架，用于改进LLM的一次性代码生成。该方法使LLM能够自主生成、评估和利用合成示例，以提高可靠性和性能。与现有方法不同，它最大限度地减少了对精选数据的依赖，并灵活适应各种编码场景。实验表明，在编码基准测试中，该框架取得了显著的改进，为无数据环境提供了可伸缩和稳健的解决方案。

Key Takeaways

大型语言模型（LLMs）在自动化代码生成中受到关注。
现实世界的编程任务或基准测试缺乏专用训练数据集。
现有少样本提示方法面临挑战。
受人类元记忆启发，提出新型框架M^2WF改进LLM的一次性代码生成。
M^2WF使LLM能够自主生成、评估和利用合成示例。
M^2WF提高LLM的可靠性和性能。

Cool Papers

点此查看论文截图

Transforming Role Classification in Scientific Teams Using LLMs and Advanced Predictive Analytics

Authors:Wonduk Seo, Yi Bu

Scientific team dynamics are critical in determining the nature and impact of research outputs. However, existing methods for classifying author roles based on self-reports and clustering lack comprehensive contextual analysis of contributions. Thus, we present a transformative approach to classifying author roles in scientific teams using advanced large language models (LLMs), which offers a more refined analysis compared to traditional clustering methods. Specifically, we seek to complement and enhance these traditional methods by utilizing open source and proprietary LLMs, such as GPT-4, Llama3 70B, Llama2 70B, and Mistral 7x8B, for role classification. Utilizing few-shot prompting, we categorize author roles and demonstrate that GPT-4 outperforms other models across multiple categories, surpassing traditional approaches such as XGBoost and BERT. Our methodology also includes building a predictive deep learning model using 10 features. By training this model on a dataset derived from the OpenAlex database, which provides detailed metadata on academic publications – such as author-publication history, author affiliation, research topics, and citation counts – we achieve an F1 score of 0.76, demonstrating robust classification of author roles.

科研团队的动态在决定研究成果的性质和影响方面至关重要。然而，现有的基于自我报告和聚类的作者角色分类方法缺乏对贡献的全面上下文分析。因此，我们提出了一种利用先进的大型语言模型（LLMs）对科研团队中的作者角色进行分类的变革性方法，与传统聚类方法相比，它提供了更为精细的分析。具体而言，我们希望通过利用开源和专有的大型语言模型，如GPT-4、Llama3 70B、Llama2 70B和Mistral 7x8B来进行角色分类，以补充和增强这些传统方法。通过少样本提示，我们对作者角色进行了分类，并证明GPT-4在多类别中表现优于其他模型，超越了传统的XGBoost和BERT等方法。我们的方法还包括使用10个特征构建预测深度学习模型。通过在OpenAlex数据库衍生的数据集上训练该模型（该数据库提供有关学术出版的详细元数据，如作者出版历史、作者隶属关系、研究主题和引用计数），我们获得了0.76的F1分数，证明了作者角色分类的稳健性。

论文及项目相关链接

PDF 14 pages, 4 figures, 3 tables

摘要

科研团队动态对于决定研究产出的性质和影响力至关重要。然而，现有的基于自我报告和聚类的作者角色分类方法缺乏全面分析贡献的语境。因此，我们提出了一种变革性的方法，利用先进的大型语言模型（LLMs）对科研团队中的作者角色进行分类，相比传统的聚类方法，LLMs提供了更为精细的分析。具体来说，我们结合并改进传统方法，使用开源和专有LLMs（如GPT-4、Llama3 70B、Llama2 70B和Mistral 7x8B）进行角色分类。通过少量提示，我们对作者角色进行分类，并证明GPT-4在多类别中表现优于其他模型，超越了如XGBoost和BERT等传统方法。我们的方法还包括使用OpenAlex数据库构建的预测深度学习模型，该数据库提供有关学术出版的详细元数据，如作者出版历史、作者隶属关系、研究主题和引用计数等。通过该模型，我们实现了F1分数为0.76，证明了作者角色分类的稳健性。

关键见解

科研团队动态对研究产出的性质和影响力至关重要。
现有作者角色分类方法缺乏全面分析贡献的语境。
利用先进的大型语言模型（LLMs）进行作者角色分类提供更精细的分析。
GPT-4在作者角色分类中表现优于其他模型。
GPT-4在多个类别中超越了传统方法，如XGBoost和BERT。
使用OpenAlex数据库的预测深度学习模型实现稳健的作者角色分类，F1分数为0.76。

Cool Papers

点此查看论文截图

Exploring the Use of Contrastive Language-Image Pre-Training for Human Posture Classification: Insights from Yoga Pose Analysis

Authors:Andrzej D. Dobrzycki, Ana M. Bernardos, Luca Bergesio, Andrzej Pomirski, Daniel Sáez-Trigueros

Accurate human posture classification in images and videos is crucial for automated applications across various fields, including work safety, physical rehabilitation, sports training, or daily assisted living. Recently, multimodal learning methods, such as Contrastive Language-Image Pretraining (CLIP), have advanced significantly in jointly understanding images and text. This study aims to assess the effectiveness of CLIP in classifying human postures, focusing on its application in yoga. Despite the initial limitations of the zero-shot approach, applying transfer learning on 15,301 images (real and synthetic) with 82 classes has shown promising results. The article describes the full procedure for fine-tuning, including the choice for image description syntax, models and hyperparameters adjustment. The fine-tuned CLIP model, tested on 3826 images, achieves an accuracy of over 85%, surpassing the current state-of-the-art of previous works on the same dataset by approximately 6%, its training time being 3.5 times lower than what is needed to fine-tune a YOLOv8-based model. For more application-oriented scenarios, with smaller datasets of six postures each, containing 1301 and 401 training images, the fine-tuned models attain an accuracy of 98.8% and 99.1%, respectively. Furthermore, our experiments indicate that training with as few as 20 images per pose can yield around 90% accuracy in a six-class dataset. This study demonstrates that this multimodal technique can be effectively used for yoga pose classification, and possibly for human posture classification, in general. Additionally, CLIP inference time (around 7 ms) supports that the model can be integrated into automated systems for posture evaluation, e.g., for developing a real-time personal yoga assistant for performance assessment.

准确地对图像和视频中的人体姿态进行分类，对于包括工作安全、身体康复、运动训练和日常生活辅助在内的各个领域中的自动化应用至关重要。最近，如对比语言图像预训练（CLIP）等多模态学习方法在联合理解图像和文字方面取得了显著进展。本研究旨在评估CLIP在人体姿态分类方面的有效性，重点关注其在瑜伽中的应用。尽管零样本方法存在一些初步限制，但在15301张（真实和合成）包含82类的图像上应用迁移学习已显示出令人鼓舞的结果。文章描述了微调的全过程，包括图像描述语法、模型和超参数调整的选择。经过精细调整的CLIP模型在3826张图像上测试时，准确率超过85%，比同一数据集上的先前最新技术高出约6%，而且其训练时间是微调YOLOv8模型所需时间的3.5倍低。对于更面向应用的环境，包含每个姿态仅有1301张和401张训练图像的小数据集，精细调整后的模型分别达到了98.8%和99.1%的准确率。此外，我们的实验表明，每姿态仅使用20张图像进行训练即可在包含六个类别的数据集中达到约90%的准确率。本研究表明，这种多模态技术可有效用于瑜伽姿势分类，也可能广泛用于一般的人体姿态分类。此外，CLIP推理时间约为7毫秒，支持将模型集成到姿势评估的自动化系统中，例如开发用于性能评估的实时个人瑜伽助理。

论文及项目相关链接

PDF

摘要

本文研究了利用多模态学习方法，如Contrastive Language-Image Pretraining (CLIP)，进行人类姿势分类的应用。特别是在瑜伽姿势分类方面，通过调整模型、参数和优化图像描述语法等步骤进行微调，实验结果显示fine-tuned的CLIP模型在瑜伽姿势分类上取得了较高的准确率，超过现有技术水平约6%，且训练时间减少了3.5倍。研究还表明，使用少量图像（如每姿势仅20张图像）进行训练也能达到约90%的准确率。因此，该多模态技术可有效应用于瑜伽姿势分类，甚至可能适用于一般的人类姿势分类。此外，CLIP模型的推理时间约为7毫秒，支持集成到自动姿势评估系统中。

关键见解

多模态学习方法如CLIP在姿势分类中有广泛应用前景。
CLIP模型在瑜伽姿势分类上表现优异，准确率超过现有技术约6%。
通过微调，CLIP模型可以在较小的数据集上实现高准确率。
使用少量图像（每姿势仅20张）进行训练也可达到良好的准确率。
CLIP模型的推理时间短，适合集成到实时系统中进行姿势评估。
此研究为瑜伽和其他领域自动姿势评估系统的开发提供了新思路。

Cool Papers

点此查看论文截图

Matching Free Depth Recovery from Structured Light

Authors:Zhuohang Yu, Kai Wang, Juyong Zhang

We present a novel approach for depth estimation from images captured by structured light systems. Unlike many previous methods that rely on image matching process, our approach uses a density voxel grid to represent scene geometry, which is trained via self-supervised differentiable volume rendering. Our method leverages color fields derived from projected patterns in structured light systems during the rendering process, enabling the isolated optimization of the geometry field. This contributes to faster convergence and high-quality output. Additionally, we incorporate normalized device coordinates (NDC), a distortion loss, and a novel surface-based color loss to enhance geometric fidelity. Experimental results demonstrate that our method outperforms existing matching-based techniques in geometric performance for few-shot scenarios, achieving approximately a 60% reduction in average estimated depth errors on synthetic scenes and about 30% on real-world captured scenes. Furthermore, our approach delivers fast training, with a speed roughly three times faster than previous matching-free methods that employ implicit representations.

我们提出了一种基于结构光系统采集的图像进行深度估计的新方法。不同于许多依赖于图像匹配过程的先前方法，我们的方法使用密度体素网格来表示场景几何结构，并通过自我监督的可微体积渲染进行训练。我们的方法在渲染过程中利用结构光系统中投影图案产生的颜色场，实现对几何场的独立优化。这有助于更快的收敛和更高质量的结果。此外，我们结合了归一化设备坐标（NDC）、畸变损失和基于表面的颜色损失，以提高几何保真度。实验结果表明，我们的方法在少量场景下的几何性能表现优于现有的基于匹配的技术。在合成场景上，我们的方法在平均估计深度误差上大约减少了60%，在真实捕获的场景上大约减少了30%。而且，我们的方法训练速度快，大约是之前使用隐式表示的无匹配方法的三倍速度。

论文及项目相关链接

PDF 10 pages, 8 figures

Summary

本文提出了一种基于结构化光系统图像的新深度估计方法。与其他依赖图像匹配过程的方法不同，我们的方法使用密度体素网格表示场景几何，并通过自监督的可微分体积渲染进行训练。该方法利用结构化光系统中渲染过程中的投影模式产生的颜色场，实现几何场的独立优化，从而提高了收敛速度和输出质量。此外，还结合了归一化设备坐标（NDC）、畸变损失和基于表面的颜色损失，以提高几何保真度。实验结果表明，在少样本情况下，该方法在几何性能上优于现有的基于匹配的技术，合成场景的平均估计深度误差降低了约60%，真实世界捕获的场景降低了约30%。此外，该方法训练速度快，大约是之前使用隐式表示的无匹配方法的三倍。

Key Takeaways

引入了一种基于结构化光系统的新型深度估计方法。
采用密度体素网格表示场景几何，并通过自监督可微分体积渲染进行训练。
利用颜色场实现几何场的独立优化，加快收敛速度并提高输出质量。
结合归一化设备坐标（NDC）、畸变损失和基于表面的颜色损失增强几何保真度。
在合成和真实场景下的深度估计中表现优异，相比现有方法显著减少了估计深度误差。
该方法的训练速度较快，相较于其他无匹配方法有明显的优势。
为深度估计领域提供了新的思路和方法，有望推动该领域的进一步发展。

Cool Papers

点此查看论文截图

Motion Tracks: A Unified Representation for Human-Robot Transfer in Few-Shot Imitation Learning

Authors:Juntao Ren, Priya Sundaresan, Dorsa Sadigh, Sanjiban Choudhury, Jeannette Bohg

Teaching robots to autonomously complete everyday tasks remains a challenge. Imitation Learning (IL) is a powerful approach that imbues robots with skills via demonstrations, but is limited by the labor-intensive process of collecting teleoperated robot data. Human videos offer a scalable alternative, but it remains difficult to directly train IL policies from them due to the lack of robot action labels. To address this, we propose to represent actions as short-horizon 2D trajectories on an image. These actions, or motion tracks, capture the predicted direction of motion for either human hands or robot end-effectors. We instantiate an IL policy called Motion Track Policy (MT-pi) which receives image observations and outputs motion tracks as actions. By leveraging this unified, cross-embodiment action space, MT-pi completes tasks with high success given just minutes of human video and limited additional robot demonstrations. At test time, we predict motion tracks from two camera views, recovering 6DoF trajectories via multi-view synthesis. MT-pi achieves an average success rate of 86.5% across 4 real-world tasks, outperforming state-of-the-art IL baselines which do not leverage human data or our action space by 40%, and generalizes to scenarios seen only in human videos. Code and videos are available on our website https://portal-cornell.github.io/motion_track_policy/.

教授机器人自主完成日常任务仍然是一个挑战。模仿学习（IL）是一种通过演示赋予机器人技能的强大方法，但受限于收集遥控机器人数据的过程劳动强度高。人类视频提供了一个可扩展的替代方案，但由于缺乏机器人动作标签，从人类视频中直接训练IL策略仍然很困难。为了解决这个问题，我们提出将动作表示为图像上的短周期2D轨迹。这些动作或运动轨迹捕捉了人类手部或机器人末端执行器预测的运动方向。我们实例化了一种名为运动轨迹策略（MT-pi）的IL策略，它接收图像观察结果并输出运动轨迹作为动作。通过利用这种统一的、跨实体的动作空间，MT-pi仅使用几分钟的人类视频和有限的额外机器人演示即可成功完成任务。在测试时，我们从两个摄像机视角预测运动轨迹，通过多视角合成恢复6DoF轨迹。MT-pi在4个真实任务中的平均成功率达到86.5%，比不利用人类数据或我们动作空间的最新IL基线高出40%，并且能够推广到仅在人类视频中出现的场景。我们的网站https://portal-cornell.github.io/motion_track_policy/上有代码和视频可供查看。

论文及项目相关链接

PDF

Summary

本文探讨了机器人自主完成日常任务的教学挑战。提出了通过将动作表示为图像上的短周期二维轨迹来解决模仿学习（IL）面临的挑战。提出了一项名为运动轨迹策略（MT-pi）的模仿学习策略，可从人类视频接收图像观察并以运动轨迹的形式输出动作。通过利用这一统一的跨体现动作空间，MT-pi仅使用几分钟的人类视频和有限的额外机器人演示即能成功完成任务。测试时，我们从两个摄像机视角预测运动轨迹，通过多视角合成恢复六自由度轨迹。MT-pi在四个真实任务中的平均成功率达到86.5%，优于不利用人类数据或我们动作空间的最新IL基线，并推广到仅存在于人类视频的场景中。

Key Takeaways

机器人自主完成日常任务的教学存在挑战，尤其是数据收集方面。
模仿学习（IL）是一种有效的机器人技能传授方法，但需大量的机器人数据，人类视频提供了一种可替代的解决方案。
提出了一种新的模仿学习策略——运动轨迹策略（MT-pi），能从人类视频中学习并成功完成任务。
MT-pi通过将动作表示为图像上的短周期二维轨迹来解决直接从人类视频训练IL策略的挑战。
MT-pi利用统一的跨体现动作空间，仅使用几分钟的人类视频和有限的额外机器人演示就能成功完成任务。
在四个真实任务的测试中，MT-pi的平均成功率达到86.5%，显著优于其他IL策略。

Cool Papers

点此查看论文截图

Synthetic Prior for Few-Shot Drivable Head Avatar Inversion

Authors:Wojciech Zielonka, Stephan J. Garbin, Alexandros Lattas, George Kopanas, Paulo Gotardo, Thabo Beeler, Justus Thies, Timo Bolkart

We present SynShot, a novel method for the few-shot inversion of a drivable head avatar based on a synthetic prior. We tackle two major challenges. First, training a controllable 3D generative network requires a large number of diverse sequences, for which pairs of images and high-quality tracked meshes are not always available. Second, state-of-the-art monocular avatar models struggle to generalize to new views and expressions, lacking a strong prior and often overfitting to a specific viewpoint distribution. Inspired by machine learning models trained solely on synthetic data, we propose a method that learns a prior model from a large dataset of synthetic heads with diverse identities, expressions, and viewpoints. With few input images, SynShot fine-tunes the pretrained synthetic prior to bridge the domain gap, modeling a photorealistic head avatar that generalizes to novel expressions and viewpoints. We model the head avatar using 3D Gaussian splatting and a convolutional encoder-decoder that outputs Gaussian parameters in UV texture space. To account for the different modeling complexities over parts of the head (e.g., skin vs hair), we embed the prior with explicit control for upsampling the number of per-part primitives. Compared to state-of-the-art monocular methods that require thousands of real training images, SynShot significantly improves novel view and expression synthesis.

我们提出了SynShot，这是一种基于合成先验的可驾驶头部化身少样本反转的新型方法。我们解决了两个主要挑战。首先，训练可控的3D生成网络需要大量的不同序列，而图像和高质量跟踪网格的配对并不总是可用的。其次，最先进的单目化身模型很难推广到新的视角和表情，缺乏强大的先验知识，经常过度适应特定的视角分布。受到仅由合成数据训练的机器学习模型的启发，我们提出了一种从大量合成头部数据中学习先验模型的方法，这些合成数据具有不同的身份、表情和视角。凭借少量的输入图像，SynShot对预训练的合成先验进行了微调，以弥合领域之间的差距，从而模拟一个通用到各种新表情和视角的真实头部化身。我们使用三维高斯喷绘和卷积编码器解码器来输出UV纹理空间的高斯参数来模拟头部化身。考虑到头部各部分建模复杂性的差异（例如皮肤和头发），我们通过先验嵌入来明确控制每个部分原始数据的上采样数量。与需要数千张真实训练图像的最新单目方法相比，SynShot极大地改进了新颖视角和表情的合成效果。

Summary

本文提出了SynShot方法，这是一种基于合成先验的少数头部可驱动角色的新方法。解决了两个主要挑战：缺乏大规模多样序列的训练数据和对新视角和表情的泛化能力不足。SynShot学习了一个先验模型，从大量合成头部数据中训练得来，并利用少量输入图像微调预训练模型，以缩小领域差距，生成逼真的头部角色模型，能够泛化到新的视角和表情。通过3D高斯涂片和卷积编码器解码器建模头部角色，输出UV纹理空间的高斯参数。针对头部不同部分的建模复杂性（如皮肤和头发），嵌入先验知识以控制每个部分的原始数量。相较于需要数千张真实训练图像的最先进单眼方法，SynShot显著改善了对新视角和表情的合成效果。

Key Takeaways

SynShot是一种基于合成先验的少数头部可驱动角色的新方法。
解决缺乏大规模多样序列训练数据和泛化能力不足的问题。
通过学习先验模型，利用少量输入图像微调预训练模型，生成逼真的头部角色模型。
模型能够泛化到新的视角和表情。
采用3D高斯涂片和卷积编码器解码器建模头部角色。
输出UV纹理空间的参数以精细控制模型的外观和纹理。

Cool Papers

点此查看论文截图

A Foundational Generative Model for Breast Ultrasound Image Analysis

Authors:Haojun Yu, Youcheng Li, Nan Zhang, Zihan Niu, Xuantong Gong, Yanwen Luo, Haotian Ye, Siyu He, Quanlin Wu, Wangyan Qin, Mengyuan Zhou, Jie Han, Jia Tao, Ziwei Zhao, Di Dai, Di He, Dong Wang, Binghui Tang, Ling Huo, James Zou, Qingli Zhu, Yong Wang, Liwei Wang

Foundational models have emerged as powerful tools for addressing various tasks in clinical settings. However, their potential development to breast ultrasound analysis remains untapped. In this paper, we present BUSGen, the first foundational generative model specifically designed for breast ultrasound image analysis. Pretrained on over 3.5 million breast ultrasound images, BUSGen has acquired extensive knowledge of breast structures, pathological features, and clinical variations. With few-shot adaptation, BUSGen can generate repositories of realistic and informative task-specific data, facilitating the development of models for a wide range of downstream tasks. Extensive experiments highlight BUSGen’s exceptional adaptability, significantly exceeding real-data-trained foundational models in breast cancer screening, diagnosis, and prognosis. In breast cancer early diagnosis, our approach outperformed all board-certified radiologists (n=9), achieving an average sensitivity improvement of 16.5% (P-value<0.0001). Additionally, we characterized the scaling effect of using generated data which was as effective as the collected real-world data for training diagnostic models. Moreover, extensive experiments demonstrated that our approach improved the generalization ability of downstream models. Importantly, BUSGen protected patient privacy by enabling fully de-identified data sharing, making progress forward in secure medical data utilization. An online demo of BUSGen is available at https://aibus.bio.

基础模型已作为临床环境中解决各种任务的强大工具而出现。然而，它们在乳腺超声分析方面的潜力尚未被开发。在本文中，我们介绍了BUSGen，这是专门为乳腺超声图像分析设计的基础生成模型。在超过350万张乳腺超声图像上进行预训练后，BUSGen获得了关于乳房结构、病理特征和临床变化的广泛知识。通过少量样本适应，BUSGen可以生成现实且信息丰富的特定任务数据仓库，促进针对各种下游任务的模型发展。大量实验突出了BUSGen的卓越适应性，在乳腺癌筛查、诊断和预后方面显著超过了使用真实数据训练的基础模型。在乳腺癌早期诊断方面，我们的方法超越了所有执业医师认证的放射科医生（n=9），平均敏感性提高了16.5%（P值<0.0001）。此外，我们描述了使用生成数据与收集的真实世界数据一样有效的扩展效应，用于训练诊断模型。而且，大量实验表明，我们的方法提高了下游模型的泛化能力。重要的是，BUSGen通过实现完全匿名数据共享保护了患者隐私，在安全医疗数据利用方面取得了进展。BUSGen的在线演示可在https://aibus.bio查看。

论文及项目相关链接

PDF Peking University; Stanford University; Peking University Cancer Hospital & Institute; Peking Union Medical College Hospital; Cancer Hospital, Chinese Academy of Medical Sciences

Summary

BUSGen是专门为乳腺超声图像分析设计的基础生成模型。它在超过350万张乳腺超声图像上进行预训练，并可通过少量样本适应生成特定任务的数据。该模型在乳腺癌筛查、诊断和预后等方面表现出卓越的性能，甚至超越了使用真实数据训练的基础模型。它提高了下游模型的泛化能力，并实现了患者数据的完全匿名共享。

Key Takeaways

BUSGen是首个针对乳腺超声分析的基础生成模型。
该模型在大量乳腺超声图像上进行预训练，涵盖乳房结构、病理特征和临床变化。
通过少量样本适应，BUSGen可以生成特定任务的数据。
BUSGen在乳腺癌筛查、诊断和预后方面表现出卓越性能。
该模型提高了下游模型的泛化能力。
BUSGen实现了患者数据的完全匿名共享，促进了医疗数据的利用。

Cool Papers

点此查看论文截图

An efficient approach to represent enterprise web application structure using Large Language Model in the service of Intelligent Quality Engineering

Authors:Zaber Al Hassan Ayon, Gulam Husain, Roshankumar Bisoi, Waliur Rahman, Dr Tom Osborn

This paper presents a novel approach to represent enterprise web application structures using Large Language Models (LLMs) to enable intelligent quality engineering at scale. We introduce a hierarchical representation methodology that optimizes the few-shot learning capabilities of LLMs while preserving the complex relationships and interactions within web applications. The approach encompasses five key phases: comprehensive DOM analysis, multi-page synthesis, test suite generation, execution, and result analysis. Our methodology addresses existing challenges around usage of Generative AI techniques in automated software testing by developing a structured format that enables LLMs to understand web application architecture through in-context learning. We evaluated our approach using two distinct web applications: an e-commerce platform (Swag Labs) and a healthcare application (MediBox) which is deployed within Atalgo engineering environment. The results demonstrate success rates of 90% and 70%, respectively, in achieving automated testing, with high relevance scores for test cases across multiple evaluation criteria. The findings suggest that our representation approach significantly enhances LLMs’ ability to generate contextually relevant test cases and provide better quality assurance overall, while reducing the time and effort required for testing.

本文提出了一种利用大型语言模型（LLM）表示企业Web应用程序结构的新方法，以实现大规模智能质量工程。我们引入了一种分层表示法，该方法在优化LLM的少量学习功能的同时，保持Web应用程序内部的复杂关系和交互。该方法包括五个关键阶段：全面的DOM分析、多页面合成、测试套件生成、执行和结果分析。我们的方法通过开发一种结构化格式，使LLM能够通过上下文学习理解Web应用程序架构，解决了在自动化软件测试中使用生成式人工智能技术所面临的挑战。我们使用两个不同的Web应用程序评估了我们的方法：电子商务平台（Swag Labs）和部署在Atalgo工程环境中的医疗应用程序（MediBox）。结果显示，在自动化测试方面，电子商务平台的成功率达到90%，医疗应用程序的成功率达到70%，测试用例在多个评估标准上具有很高的相关性得分。研究结果表明，我们的表示法显著提高了LLM生成上下文相关测试用例的能力，并总体上提供了更好的质量保证，同时减少了测试所需的时间和精力。

论文及项目相关链接

PDF 16 pages, 1 figure and 4 tables, relevant for Gen AI and enterprise AI use cases

Summary

本文介绍了一种利用大型语言模型（LLMs）进行智能质量工程的新方法，旨在代表企业Web应用程序结构。引入了一种层次化的表示方法，优化了LLMs的少样本学习能力，同时保持了Web应用程序内的复杂关系和交互。该方法包括五个关键阶段：全面的DOM分析、多页面合成、测试套件生成、执行和结果分析。通过开发一种结构化格式，使LLMs能够通过上下文学习理解Web应用程序架构，解决了在自动化软件测试中使用生成式人工智能技术的现有挑战。使用两个不同的Web应用程序（电子商务平台的Swag Labs和Atalgo工程环境中的医疗保健应用程序MediBox）进行了评估，结果显示自动化测试的成功率分别为90%和70%，测试用例的相关性得分较高。这表明我们的表示方法显著提高了LLMs生成上下文相关测试用例的能力，并提供了更好的质量保证，同时减少了测试所需的时间和精力。

Key Takeaways

利用大型语言模型（LLMs）进行企业Web应用程序的智能质量工程。
提出了一种新的层次化表示方法，优化了LLMs的少样本学习能力。
方法包括全面的DOM分析、多页面合成、测试套件生成、执行和结果分析。
通过开发结构化格式，解决了在自动化软件测试中使用生成式AI技术的挑战。
通过两个Web应用程序的评估，显示了自动化测试的高成功率。
表示方法提高了LLMs生成上下文相关测试用例的能力。

Cool Papers

点此查看论文截图

Comparing Few-Shot Prompting of GPT-4 LLMs with BERT Classifiers for Open-Response Assessment in Tutor Equity Training

Authors:Sanjit Kakarla, Conrad Borchers, Danielle Thomas, Shambhavi Bhushan, Kenneth R. Koedinger

Assessing learners in ill-defined domains, such as scenario-based human tutoring training, is an area of limited research. Equity training requires a nuanced understanding of context, but do contemporary large language models (LLMs) have a knowledge base that can navigate these nuances? Legacy transformer models like BERT, in contrast, have less real-world knowledge but can be more easily fine-tuned than commercial LLMs. Here, we study whether fine-tuning BERT on human annotations outperforms state-of-the-art LLMs (GPT-4o and GPT-4-Turbo) with few-shot prompting and instruction. We evaluate performance on four prediction tasks involving generating and explaining open-ended responses in advocacy-focused training lessons in a higher education student population learning to become middle school tutors. Leveraging a dataset of 243 human-annotated open responses from tutor training lessons, we find that BERT demonstrates superior performance using an offline fine-tuning approach, which is more resource-efficient than commercial GPT models. We conclude that contemporary GPT models may not adequately capture nuanced response patterns, especially in complex tasks requiring explanation. This work advances the understanding of AI-driven learner evaluation under the lens of fine-tuning versus few-shot prompting on the nuanced task of equity training, contributing to more effective training solutions and assisting practitioners in choosing adequate assessment methods.

在基于情景的人类辅导训练等不明确领域中评估学习者是一个研究较少的领域。公平训练需要对背景有微妙的理解，但当代的大型语言模型（LLM）是否拥有能够应对这些微妙之处的知识库呢？相比之下，遗留的转换器模型（如BERT）虽然对真实世界的了解较少，但比商业LLM更容易微调。在这里，我们研究了微调BERT在人类注释上的表现是否优于最先进的大型语言模型GPT-4o和GPT-4 Turbo，采用少样本提示和指令。我们在四个预测任务中评估了性能，涉及在高等教育学生群体中，以倡导为核心的学习成为中学辅导老师的培训课程中，生成和解释开放式答案的响应。我们利用由人类注释的辅导训练课程的开放式答案数据集（共243份数据），发现使用离线微调方法的BERT表现出了优越的性能表现，该方法比商业GPT模型更加资源高效。我们得出结论，当代GPT模型可能无法充分捕捉复杂的响应模式，特别是在需要解释的复杂任务中。这项工作在微调与少样本提示下，进一步了解AI驱动的学习者评估在公平训练任务中的微妙之处，为更有效的训练解决方案做出贡献，并帮助从业者选择适当的评估方法。

论文及项目相关链接

PDF 8 Page Workshop Paper, AAAI2025 Workshop on Innovation and Responsibility in AI-Supported Education (iRAISE) - Open-response Grading, Feedback, Equity Training, LLMs, BERT, GPT-4

Summary

该研究探讨了评估基于情景的人类辅导训练中的学习者的问题。研究发现，虽然大型语言模型（LLMs）在处理复杂任务时面临挑战，但通过微调BERT模型在四项预测任务中的表现优于GPT模型。利用人类注释数据集进行研究的评估发现，通过线下微调的BERT模型更具优势。这显示出针对具体情境训练需求的细致考虑，并为更有效的训练解决方案和评估方法提供了参考。

Key Takeaways

研究领域：该研究关注于基于情景的人类辅导训练中的学习者评估问题，探讨GPT模型和BERT模型在该领域的表现差异。
实验设计：通过比较不同模型在四项预测任务中的表现，发现BERT模型更具优势。采用人类注释数据集进行研究评估。
性能评估：研究发现，通过微调的BERT模型在预测任务中表现优于GPT模型，尤其是在复杂任务中需要解释的部分。
资源效率：使用线下微调的方法使得BERT模型更具资源效率。

Cool Papers

点此查看论文截图

BayesAdapter: enhanced uncertainty estimation in CLIP few-shot adaptation

Authors:Pablo Morales-Álvarez, Stergios Christodoulidis, Maria Vakalopoulou, Pablo Piantanida, Jose Dolz

The emergence of large pre-trained vision-language models (VLMs) represents a paradigm shift in machine learning, with unprecedented results in a broad span of visual recognition tasks. CLIP, one of the most popular VLMs, has exhibited remarkable zero-shot and transfer learning capabilities in classification. To transfer CLIP to downstream tasks, adapters constitute a parameter-efficient approach that avoids backpropagation through the large model (unlike related prompt learning methods). However, CLIP adapters have been developed to target discriminative performance, and the quality of their uncertainty estimates has been overlooked. In this work we show that the discriminative performance of state-of-the-art CLIP adapters does not always correlate with their uncertainty estimation capabilities, which are essential for a safe deployment in real-world scenarios. We also demonstrate that one of such adapters is obtained through MAP inference from a more general probabilistic framework. Based on this observation we introduce BayesAdapter, which leverages Bayesian inference to estimate a full probability distribution instead of a single point, better capturing the variability inherent in the parameter space. In a comprehensive empirical evaluation we show that our approach obtains high quality uncertainty estimates in the predictions, standing out in calibration and selective classification. Our code will be publicly available upon acceptance of the paper.

大型预训练视觉语言模型（VLMs）的出现代表了机器学习范式的一种转变，并在广泛的视觉识别任务中取得了前所未有的结果。CLIP作为最受欢迎的VLM之一，在分类方面展现出了惊人的零样本和迁移学习能力。为了将CLIP迁移到下游任务，适配器（adapters）是一种参数高效的策略，避免了在大模型中进行反向传播（这与相关的提示学习方法不同）。然而，CLIP适配器旨在提高判别性能，而忽略了其不确定性估计的质量。在这项工作中，我们展示了最先进的CLIP适配器的判别性能并不总是与他们的不确定性估计能力相关，这对于在现实场景中的安全部署至关重要。我们还证明，其中一些适配器是通过更通用的概率框架通过最大后验概率推断得到的。基于这一观察，我们引入了BayesAdapter，它利用贝叶斯推断来估计一个完整的概率分布，而不是一个单一的点，从而更好地捕捉参数空间中固有的变异性。在一项全面的经验评估中，我们展示了我们的方法在预测中获得高质量的不确定性估计，在校准和选择性分类方面表现出色。论文被接受后，我们的代码将公开可用。

论文及项目相关链接

PDF 30 pages, 5 figures, 23 tables

Summary

大型预训练视觉语言模型（VLMs）的出现代表了机器学习领域的一个范式转变，已经在广泛的视觉识别任务中取得了前所未有的结果。CLIP作为一种流行的VLM，展现出令人印象深刻的零样本和迁移学习能力。为了在下游任务中迁移CLIP，适配器成为一种参数高效的方法，避免了通过大型模型的反向传播（与相关的提示学习方法不同）。然而，现有的CLIP适配器主要关注判别性能，忽略了不确定性估计的质量。本研究表明，最先进的CLIP适配器的判别性能并不总是与他们的不确定性估计能力相关，这对于在真实场景中的安全部署至关重要。此外，我们展示了通过更通用的概率框架进行MAP推理可以获得一种适配器。基于此观察，我们引入了BayesAdapter，它利用贝叶斯推理来估计一个完整的概率分布，而不是一个点，从而更好地捕捉参数空间中的固有变化。经过综合的实证评估，我们的方法在预测中获得高质量的不确定性估计，在校准和选择性分类方面表现出色。

Key Takeaways

大型预训练视觉语言模型（VLMs）在视觉识别任务中取得显著成果，标志着机器学习领域的范式转变。
CLIP作为一种流行的VLM，具有零样本和迁移学习能力。
适配器是参数高效的方法，用于在下游任务中迁移CLIP，避免大型模型反向传播。
现有CLIP适配器主要关注判别性能，但不确定性估计质量同样重要，尤其在真实场景部署中。
通过更通用的概率框架进行MAP推理可得到一种适配器。
引入BayesAdapter，利用贝叶斯推理估计完整的概率分布，提高不确定性估计质量。

Cool Papers

点此查看论文截图

Sparse Attention Vectors: Generative Multimodal Model Features Are Discriminative Vision-Language Classifiers

Authors:Chancharik Mitra, Brandon Huang, Tianning Chai, Zhiqiu Lin, Assaf Arbelle, Rogerio Feris, Leonid Karlinsky, Trevor Darrell, Deva Ramanan, Roei Herzig

Generative Large Multimodal Models (LMMs) like LLaVA and Qwen-VL excel at a wide variety of vision-language (VL) tasks such as image captioning or visual question answering. Despite strong performance, LMMs are not directly suited for foundational discriminative vision-language tasks (i.e., tasks requiring discrete label predictions) such as image classification and multiple-choice VQA. One key challenge in utilizing LMMs for discriminative tasks is the extraction of useful features from generative models. To overcome this issue, we propose an approach for finding features in the model’s latent space to more effectively leverage LMMs for discriminative tasks. Toward this end, we present Sparse Attention Vectors (SAVs) – a finetuning-free method that leverages sparse attention head activations (fewer than 1% of the heads) in LMMs as strong features for VL tasks. With only few-shot examples, SAVs demonstrate state-of-the-art performance compared to a variety of few-shot and finetuned baselines on a collection of discriminative tasks. Our experiments also imply that SAVs can scale in performance with additional examples and generalize to similar tasks, establishing SAVs as both effective and robust multimodal feature representations.

生成式大型多模态模型（LMMs），如LLaVA和Qwen-VL，在各种视觉语言（VL）任务（如图像描述或视觉问答）中都表现出卓越的性能。尽管表现强劲，但LMMs并不直接适用于基础判别式视觉语言任务（即需要离散标签预测的任务），如图像分类和多项选择VQA。利用LMMs进行判别任务的一个关键挑战是从生成模型中提取有用的特征。为了解决这个问题，我们提出了一种在模型潜在空间中找到特征的方法，以更有效地利用LMMs进行判别任务。为此，我们提出了稀疏注意力向量（SAVs）——这是一种无需微调的方法，它利用LMM中不到1%的稀疏注意力头激活作为强大的VL任务特征。凭借少量样本，SAVs在多个判别任务上相对于各种小样方法和微调基准线展现了最先进的性能。我们的实验还暗示，随着额外样本的增加，SAVs的性能可以进一步提高，并能够推广至类似任务，这表明SAVs是有效且稳健的多模态特征表示。

论文及项目相关链接

PDF

Summary

基于大型多模态模型（LMMs）如LLaVA和Qwen-VL的优异性能，它们擅长多种视觉语言（VL）任务，如图像描述和视觉问答。然而，对于需要离散标签预测的基础鉴别型视觉语言任务，如图像分类和多项选择问答等，LMMs并不直接适用。为了克服这一挑战，本研究提出了一种在模型潜在空间寻找特征的方法，以更有效地利用LMMs进行鉴别任务。为此，我们提出了稀疏注意力向量（SAVs）——一种无需微调的方法，利用LMMs中不到1%的稀疏注意力头激活作为视觉语言任务的强特征。仅通过少量样本，SAVs在多个鉴别任务上表现出卓越的性能，与多种少样本和微调基线相比具有显著优势。实验还表明，随着样本数量的增加，SAVs的性能可进一步提升，并能推广到类似任务，证明了其作为有效且稳健的多模态特征表示的优势。

Key Takeaways

LMMs如LLaVA和Qwen-VL擅长图像描述和视觉问答等VL任务。
LMMs不直接适用于需要离散标签预测的鉴别型视觉语言任务。
利用模型潜在空间寻找特征的方法能更有效地利用LMMs进行鉴别任务。
SAVs是一种无需微调的方法，利用稀疏注意力头激活作为强特征。
SAVs在少量样本下表现出卓越性能，并在多个鉴别任务上优于其他方法。
SAVs性能可随样本数量增加而提升，并具备推广到类似任务的能力。
SAVs作为有效且稳健的多模态特征表示具有优势。

Cool Papers

点此查看论文截图

Few-Shot Task Learning through Inverse Generative Modeling

Authors:Aviv Netanyahu, Yilun Du, Antonia Bronars, Jyothish Pari, Joshua Tenenbaum, Tianmin Shu, Pulkit Agrawal

Learning the intents of an agent, defined by its goals or motion style, is often extremely challenging from just a few examples. We refer to this problem as task concept learning and present our approach, Few-Shot Task Learning through Inverse Generative Modeling (FTL-IGM), which learns new task concepts by leveraging invertible neural generative models. The core idea is to pretrain a generative model on a set of basic concepts and their demonstrations. Then, given a few demonstrations of a new concept (such as a new goal or a new action), our method learns the underlying concepts through backpropagation without updating the model weights, thanks to the invertibility of the generative model. We evaluate our method in five domains – object rearrangement, goal-oriented navigation, motion caption of human actions, autonomous driving, and real-world table-top manipulation. Our experimental results demonstrate that via the pretrained generative model, we successfully learn novel concepts and generate agent plans or motion corresponding to these concepts in (1) unseen environments and (2) in composition with training concepts.

从少量示例中学习代理的意图（由其目标或动作风格定义）通常极具挑战性。我们将此问题称为任务概念学习，并提出我们的方法——通过逆向生成建模进行少量任务学习（FTL-IGM），该方法利用可逆神经生成模型来学习新任务概念。核心思想是在一组基本概念及其演示上预训练生成模型。然后，对于新概念的一些演示（例如新目标或新动作），我们的方法通过反向传播学习潜在概念，并且由于生成模型的可逆性，无需更新模型权重。我们在五个领域评估了我们的方法——物体重组、目标导向导航、人类动作的运动字幕、自动驾驶和现实世界桌面操作。我们的实验结果表明，通过预训练的生成模型，我们成功地学习了新概念和生成与这些概念相对应的代理计划或运动：（1）在未见过的环境中；（2）与训练概念组合在一起。

论文及项目相关链接

PDF Added acknowledgment

Summary

基于有限示例学习代理的意图，即其目标或动作风格，是一项重大挑战。我们称此问题为任务概念学习，并提出我们的方法——通过逆向生成模型进行少量任务学习（FTL-IGM），利用可逆神经生成模型学习新任务概念。核心思想是在基本概念及其演示上预训练生成模型，然后只需少量新概念的演示（如新目标或新动作），即可通过反向传播学习潜在概念，得益于生成模型的可逆性，无需更新模型权重。我们在五个领域评估了我们的方法——物体重新排列、目标导向导航、人类动作运动字幕、自动驾驶和现实世界桌面操作。实验结果表明，通过预训练的生成模型，我们成功学习了新概念和生成与这些概念相对应的代理计划或运动，在未见过的环境中以及组合训练概念的情况下均如此。

Key Takeaways

Few-Shot Task Learning through Inverse Generative Modeling (FTL-IGM) 是一种解决任务概念学习问题的方法。
该方法利用可逆神经生成模型，通过预训练学习基本概念及其演示。
仅需少量新概念的演示，即可通过反向传播学习潜在概念，无需更新模型权重。
FTL-IGM 在五个不同领域进行了评估，包括物体重新排列、目标导向导航、运动字幕、自动驾驶和桌面操作。
实验结果表明，该方法能够成功学习新概念和生成与这些新概念相对应的代理计划或运动。
该方法能够在未见过的环境中以及组合训练概念的情况下进行学习。

Cool Papers

点此查看论文截图

FoMo: A Foundation Model for Mobile Traffic Forecasting with Diffusion Model

Authors:Haoye Chai, Xiaoqian Qi, Shiyuan Zhang, Yong Li

Mobile traffic forecasting allows operators to anticipate network dynamics and performance in advance, offering substantial potential for enhancing service quality and improving user experience. However, existing models are often task-oriented and are trained with tailored data, which limits their effectiveness in diverse mobile network tasks of Base Station (BS) deployment, resource allocation, energy optimization, etc. and hinders generalization across different urban environments. Foundation models have made remarkable strides across various domains of NLP and CV due to their multi-tasking adaption and zero/few-shot learning capabilities. In this paper, we propose an innovative Foundation model for Mo}bile traffic forecasting (FoMo), aiming to handle diverse forecasting tasks of short/long-term predictions and distribution generation across multiple cities to support network planning and optimization. FoMo combines diffusion models and transformers, where various spatio-temporal masks are proposed to enable FoMo to learn intrinsic features of different tasks, and a contrastive learning strategy is developed to capture the correlations between mobile traffic and urban contexts, thereby improving its transfer learning capability. Extensive experiments on 9 real-world datasets demonstrate that FoMo outperforms current models concerning diverse forecasting tasks and zero/few-shot learning, showcasing a strong universality.

移动流量预测使运营商能够提前预测网络动态和性能，为提高服务质量和改善用户体验提供了巨大潜力。然而，现有模型通常是面向任务的，并且使用定制数据进行训练，这限制了它们在基站部署、资源配置、能源优化等多样化移动网络任务中的有效性，并阻碍了它们在不同城市环境中的泛化能力。由于其在多任务适应和零/少样本学习能力方面的突出表现，基础模型已在NLP和CV的各个领域取得了显著的进步。在本文中，我们提出了一种创新的移动流量预测基础模型（FoMo），旨在处理短期/长期预测和跨多个城市的分布生成的多样化预测任务，以支持网络规划和优化。FoMo结合了扩散模型和转换器，其中提出了各种时空掩码，以使其能够学习不同任务的内蕴特征，并开发了一种对比学习策略来捕捉移动流量与城市背景之间的相关性，从而提高其迁移学习能力。在9个真实世界数据集上的大量实验表明，FoMo在多样化的预测任务和零/少样本学习上优于当前模型，展示了强大的通用性。

论文及项目相关链接

PDF 11 pages, 7 figures

Summary

本文提出了一个创新的移动流量预测基础模型（FoMo），旨在处理多样的预测任务，包括短期和长期预测以及跨多个城市的分布生成，以支持网络规划和优化。FoMo结合了扩散模型和转换器，通过提出各种时空掩码来学习任务的内蕴特征，并发展了一种对比学习策略来捕捉移动流量和城市环境之间的关联，提高了其迁移学习能力。在9个真实世界数据集上的实验表明，FoMo在多样预测任务和零/少样本学习方面优于当前模型，表现出强大的通用性。

Key Takeaways

移动流量预测允许运营商预测网络动态和性能，对提高服务质量和用户体验有巨大潜力。
现有模型通常面向特定任务，并使用定制数据进行训练，这在多样化的移动网络任务（如基站部署、资源分配、能源优化等）中限制了其有效性，并阻碍了在不同城市环境中的泛化。
提出的FoMo模型旨在处理多样化的预测任务，包括短期和长期预测以及跨多个城市的分布生成，以支持网络规划和优化。
FoMo结合了扩散模型和转换器。
FoMo通过提出各种时空掩码来学习任务的内蕴特征。
对比学习策略被开发来捕捉移动流量和城市环境之间的关联，提高了模型的迁移学习能力。

Cool Papers

点此查看论文截图

Retrieval-Reasoning Large Language Model-based Synthetic Clinical Trial Generation

Authors:Zerui Xu, Fang Wu, Yuanyuan Zhang, Yue Zhao

Machine learning (ML) exhibits promise in the clinical domain. However, it is constrained by data scarcity and ethical considerations, as the generation of clinical trials presents significant challenges due to stringent privacy regulations, high costs, and the extended duration required for conducting studies with human participants. Despite the advancements of large language models (LLMs) in general generation tasks, their potential in facilitating the generation of synthetic clinical trials is under-explored. To address this gap, we introduce a novel Retrieval-Reasoning few-shot framework that leverages LLMs to generate artificial yet realistic and diverse clinical trials with binary success/failure labels. Experiments conducted on real clinical trials from the \url{ClinicalTrials.gov} database demonstrate that our synthetic data can effectively augment real datasets. Furthermore, by fine-tuning a pre-trained model as a binary classifier on synthetic clinical trial datasets, we demonstrate that this augmentation enhances model training for downstream tasks such as trial outcome prediction. Our findings suggest that LLMs for synthetic clinical trial generation hold promise for accelerating clinical research and upholding ethical standards for patient privacy. The code is publicly available at https://anonymous.4open.science/r/Retrieval_Reasoning_Clinical_Trial_Generation-3EC4.

机器学习（ML）在临床领域具有广阔前景。然而，由于严格的隐私法规、高昂成本和开展研究所需的长周期，临床试验的生成面临巨大挑战，导致数据稀缺和伦理考量限制了其发展。尽管大型语言模型（LLM）在通用生成任务中取得了进展，但其在促进合成临床试验生成方面的潜力尚未得到充分探索。为了弥补这一空白，我们提出了一种新型的检索推理小样框架，该框架利用LLM生成人工但现实且多样的临床试验数据，并带有二元成功/失败标签。在ClinicalTrials.gov数据库的实际临床试验上进行的实验表明，我们的合成数据可以有效地扩充真实数据集。此外，通过对预训练模型进行微调，将其作为合成临床试验数据集上的二元分类器，我们证明了这种扩充可以增强下游任务的模型训练，如试验结局预测。我们的研究结果表明，用于合成临床试验生成的LLM在加速临床研究和维持患者隐私的伦理标准方面具发展潜力。代码公开在：https://anonymous.4open.science/r/Retrieval_Reasoning_Clinical_Trial_Generation-3EC4。

论文及项目相关链接

PDF

Summary

机器学习在临床领域具有广阔的应用前景，但受限于数据稀缺和伦理考量。由于严格的隐私规定、高昂成本和长期的研究周期，临床试验的生成面临重大挑战。尽管大型语言模型在通用生成任务上有所突破，但其用于生成合成临床试验的潜力尚未得到充分探索。本研究提出了一种新型的检索推理少样本框架，利用大型语言模型生成具有二进制成功/失败标签的合成临床试验数据。实验证明，合成数据能有效扩充真实数据集，且通过微调预训练模型作为二元分类器进行训练，能提高下游任务如试验结局预测的模型性能。研究结果表明，利用大型语言模型生成合成临床试验数据对于加速临床研究并维持患者隐私的伦理标准具有潜力。

Key Takeaways