LLM

发布日期: 2024-12-27

更新日期: 2024-12-27

文章字数: 7.4k

阅读时长: 30 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2024-12-27 更新

Unleashing the Temporal-Spatial Reasoning Capacity of GPT for Training-Free Audio and Language Referenced Video Object Segmentation

Authors:Shaofei Huang, Rui Ling, Hongyu Li, Tianrui Hui, Zongheng Tang, Xiaoming Wei, Jizhong Han, Si Liu

In this paper, we propose an Audio-Language-Referenced SAM 2 (AL-Ref-SAM 2) pipeline to explore the training-free paradigm for audio and language-referenced video object segmentation, namely AVS and RVOS tasks. The intuitive solution leverages GroundingDINO to identify the target object from a single frame and SAM 2 to segment the identified object throughout the video, which is less robust to spatiotemporal variations due to a lack of video context exploration. Thus, in our AL-Ref-SAM 2 pipeline, we propose a novel GPT-assisted Pivot Selection (GPT-PS) module to instruct GPT-4 to perform two-step temporal-spatial reasoning for sequentially selecting pivot frames and pivot boxes, thereby providing SAM 2 with a high-quality initial object prompt. Within GPT-PS, two task-specific Chain-of-Thought prompts are designed to unleash GPT’s temporal-spatial reasoning capacity by guiding GPT to make selections based on a comprehensive understanding of video and reference information. Furthermore, we propose a Language-Binded Reference Unification (LBRU) module to convert audio signals into language-formatted references, thereby unifying the formats of AVS and RVOS tasks in the same pipeline. Extensive experiments on both tasks show that our training-free AL-Ref-SAM 2 pipeline achieves performances comparable to or even better than fully-supervised fine-tuning methods. The code is available at: https://github.com/appletea233/AL-Ref-SAM2.

本文提出了一种基于音频-语言参考的SAM 2（AL-Ref-SAM 2）管道，用于探索音频和语言参考的视频对象分割的无训练范式，即AVS和RVOS任务。该直观解决方案利用GroundingDINO从单帧中识别目标对象，并使用SAM 2在整个视频中分割已识别的对象。由于缺乏对视频上下文探索，该方案对时空变化不太稳健。因此，在我们的AL-Ref-SAM 2管道中，我们提出了一种新颖的GPT辅助支点选择（GPT-PS）模块，以指导GPT-4执行两步时空推理，从而按顺序选择支点帧和支点框，为SAM 2提供高质量的初始对象提示。在GPT-PS中，设计了两个针对任务的思维链提示，通过引导GPT基于视频和参考信息的全面理解进行选择，以激发GPT的时空推理能力。此外，我们提出了一种语言绑定参考统一（LBRU）模块，将音频信号转换为语言格式的参考，从而统一管道中AVS和RVOS任务格式。在这两项任务上的大量实验表明，我们的无训练AL-Ref-SAM 2管道的性能可与完全监督的微调方法相媲美甚至更好。代码可通过以下网址获得：https://github.com/appletea233/AL-Ref-SAM2。

论文及项目相关链接

PDF Accepted by AAAI 2025

Summary

本文提出了一种基于音频-语言参考的SAM 2（AL-Ref-SAM 2）管道，用于探索无需训练即可进行音频和视频参考的对象分割任务。该方法利用GroundingDINO识别目标对象，使用SAM 2在视频中对识别对象进行分割。为了克服由于缺乏视频上下文探索导致的时空变化鲁棒性不足的问题，引入了GPT辅助的枢轴选择（GPT-PS）模块，指导GPT-4进行两步时空推理，选择枢轴帧和枢轴框，为SAM 2提供高质量的对象初始提示。同时，提出了语言绑定参考统一（LBRU）模块，将音频信号转换为语言格式的参考，统一AVS和RVOS任务在同一管道中的格式。实验表明，该训练免费的AL-Ref-SAM 2管道的性能可与或优于完全监督的微调方法。

Key Takeaways

提出了基于音频-语言参考的SAM 2管道（AL-Ref-SAM 2），用于无需训练的音频和视频参考对象分割任务。
利用GroundingDINO识别目标对象，通过SAM 2进行视频分割。
引入GPT辅助的枢轴选择（GPT-PS）模块以提高视频分割的鲁棒性。
GPT-PS模块通过指导GPT-4进行两步时空推理来选择枢轴帧和枢轴框。
提出语言绑定参考统一（LBRU）模块，将音频信号转换为语言格式参考，统一不同任务的格式。
在两种任务上进行的广泛实验表明，AL-Ref-SAM 2管道性能优越。

Cool Papers

点此查看论文截图

SpikingSSMs: Learning Long Sequences with Sparse and Parallel Spiking State Space Models

Authors:Shuaijie Shen, Chao Wang, Renzhuo Huang, Yan Zhong, Qinghai Guo, Zhichao Lu, Jianguo Zhang, Luziwei Leng

Known as low energy consumption networks, spiking neural networks (SNNs) have gained a lot of attention within the past decades. While SNNs are increasing competitive with artificial neural networks (ANNs) for vision tasks, they are rarely used for long sequence tasks, despite their intrinsic temporal dynamics. In this work, we develop spiking state space models (SpikingSSMs) for long sequence learning by leveraging on the sequence learning abilities of state space models (SSMs). Inspired by dendritic neuron structure, we hierarchically integrate neuronal dynamics with the original SSM block, meanwhile realizing sparse synaptic computation. Furthermore, to solve the conflict of event-driven neuronal dynamics with parallel computing, we propose a light-weight surrogate dynamic network which accurately predicts the after-reset membrane potential and compatible to learnable thresholds, enabling orders of acceleration in training speed compared with conventional iterative methods. On the long range arena benchmark task, SpikingSSM achieves competitive performance to state-of-the-art SSMs meanwhile realizing on average 90% of network sparsity. On language modeling, our network significantly surpasses existing spiking large language models (spikingLLMs) on the WikiText-103 dataset with only a third of the model size, demonstrating its potential as backbone architecture for low computation cost LLMs.

被称为低功耗神经网络，脉冲神经网络（SNNs）在过去的几十年中引起了广泛关注。虽然SNNs在视觉任务方面与人工神经网络（ANNs）的竞争日益激烈，但它们很少用于长序列任务，尽管它们具有内在的时空动态特性。在这项工作中，我们通过利用状态空间模型（SSMs）的序列学习能力，开发用于长序列学习的脉冲状态空间模型（SpikingSSMs）。受树突神经元结构的启发，我们分层地将神经元动态与原始SSM块集成，同时实现稀疏突触计算。此外，为了解决事件驱动神经元动力学与并行计算之间的冲突，我们提出了一个轻量级的替代动态网络，该网络能够准确预测重置后的膜电位并且与可学习的阈值兼容，与常规迭代方法相比，训练速度提高了几个数量级。在远程区域基准任务上，SpikingSSM实现了与最新SSM的竞争性能，同时实现了平均90%的网络稀疏性。在语言建模方面，我们的网络在WikiText-103数据集上的表现显著超越了现有的脉冲大型语言模型（spikingLLMs），并且只有三分之一模型大小，显示出其作为低计算成本LLM的骨干架构的潜力。

论文及项目相关链接

PDF

Summary

本文介绍了尖峰状态空间模型（SpikingSSMs）在长序列学习方面的应用。该模型结合了状态空间模型（SSMs）的序列学习能力，并利用树突神经元结构的启发，实现了神经元动力学的层次集成和稀疏突触计算。为解决事件驱动神经元动力学与并行计算的冲突，提出了轻量级替代动态网络，可准确预测重置后的膜电位并与学习阈值兼容，使得训练速度比传统迭代方法有所加快。在长范围区域基准测试中，SpikingSSM实现了与最新SSM技术的竞争性能，同时实现了平均90%的网络稀疏性。在语言建模方面，该网络在WikiText-103数据集上的表现显著优于现有的尖峰大型语言模型（spikingLLMs），仅使用了三分之一大小的模型，显示出其作为低计算成本大型语言模型（LLMs）的潜在架构价值。

Key Takeaways

Spiking state space models (SpikingSSMs)结合了状态空间模型（SSMs）的序列学习能力，用于长序列学习。
层级结合神经元动力学和原始SSM模块，受到树突神经元结构的启发。
提出轻量级替代动态网络，解决事件驱动神经元动力学与并行计算的冲突。
SpikingSSM在长范围区域基准测试中实现了与最新SSM技术的竞争性能，并实现了网络稀疏性。
在语言建模方面，该网络在WikiText-103数据集上的表现优于现有的尖峰大型语言模型（spikingLLMs）。
该网络架构具有潜力成为低计算成本的大型语言模型（LLMs）的骨干架构。

Cool Papers

点此查看论文截图

Security Attacks on LLM-based Code Completion Tools

Authors:Wen Cheng, Ke Sun, Xinyu Zhang, Wei Wang

The rapid development of large language models (LLMs) has significantly advanced code completion capabilities, giving rise to a new generation of LLM-based Code Completion Tools (LCCTs). Unlike general-purpose LLMs, these tools possess unique workflows, integrating multiple information sources as input and prioritizing code suggestions over natural language interaction, which introduces distinct security challenges. Additionally, LCCTs often rely on proprietary code datasets for training, raising concerns about the potential exposure of sensitive data. This paper exploits these distinct characteristics of LCCTs to develop targeted attack methodologies on two critical security risks: jailbreaking and training data extraction attacks. Our experimental results expose significant vulnerabilities within LCCTs, including a 99.4% success rate in jailbreaking attacks on GitHub Copilot and a 46.3% success rate on Amazon Q. Furthermore, We successfully extracted sensitive user data from GitHub Copilot, including 54 real email addresses and 314 physical addresses associated with GitHub usernames. Our study also demonstrates that these code-based attack methods are effective against general-purpose LLMs, such as the GPT series, highlighting a broader security misalignment in the handling of code by modern LLMs. These findings underscore critical security challenges associated with LCCTs and suggest essential directions for strengthening their security frameworks. The example code and attack samples from our research are provided at https://github.com/Sensente/Security-Attacks-on-LCCTs.

大型语言模型（LLM）的快速发展极大地提升了代码补全功能，并催生了一代基于LLM的代码补全工具（LCCTs）。与一般用途的LLM不同，这些工具拥有独特的工作流程，它们整合多种信息源作为输入，并优先于自然语言交互给出代码建议，这为代码补全工具带来独特的挑战。此外，LCCT通常依赖专有代码数据集进行训练，引发了关于潜在敏感数据泄露的担忧。本文利用LCCT的这些独特特点，针对两种关键安全风险制定了有针对性的攻击方法：越狱攻击和训练数据提取攻击。我们的实验结果揭示了LCCTs中存在的重大漏洞，包括对GitHub Copilot的越狱攻击成功率为99.4%，对亚马逊Q的成功率为46.3%。此外，我们还成功地从GitHub Copilot中提取了用户的敏感数据，包括54个真实电子邮件地址和与GitHub用户名相关的314个实体地址。我们的研究还表明，这些基于代码的攻击方法对通用LLM（如GPT系列）同样有效，这凸显了现代LLM在处理代码方面的安全配置不当问题。这些发现强调了LCCT面临的关键安全挑战，并为加强其安全框架提供了重要方向。我们的研究中的示例代码和攻击样本可在https://github.com/Sensente/Security-Attacks-on-LCCTs获得。

论文及项目相关链接

PDF Paper accepted at AAAI 2025

Summary

大型语言模型（LLM）的快速发展推动了代码补全能力的显著增强，出现了新一代基于LLM的代码补全工具（LCCTs）。这些工具具有独特的工作流程，整合多种信息源作为输入，优先提供代码建议而非自然语言交互，从而带来了新的安全挑战。此外，LCCTs通常依赖专有代码数据集进行训练，引发了关于潜在敏感数据泄露的担忧。本文利用LCCTs的独特特点，针对两项关键安全风险设计了有针对性的攻击方法：越狱攻击和训练数据提取攻击。实验结果显示，LCCTs存在重大漏洞，包括对GitHub Copilot的越狱攻击成功率高达99.4%，对亚马逊Q的成功率为46.3%。此外，我们还成功从GitHub Copilot提取了用户的敏感数据，包括54个真实电子邮件地址和314个与GitHub用户名关联的物理地址。我们的研究还表明，这些基于代码的攻击方法对通用LLM（如GPT系列）同样有效，突显了现代LLM在处理代码时的广泛安全偏差。本研究的示例代码和攻击样本可在链接找到。

Key Takeaways

LLM的快速发展推动了代码补全工具（LCCTs）的进步。
LCCTs具有独特的工作流程和安全挑战，涉及多个信息源的整合和代码建议的优先提供。
LCCTs依赖专有代码数据集进行训练，引发敏感数据泄露的担忧。
针对LCCTs存在越狱攻击和训练数据提取攻击的重大漏洞。
实验显示GitHub Copilot等LCCTs工具存在高风险漏洞，攻击者可成功提取用户敏感数据。
这些攻击方法对通用LLM同样有效，显示现代LLM在处理代码时的安全缺陷。

Cool Papers

点此查看论文截图

Advancing Mental Health Pre-Screening: A New Custom GPT for Psychological Distress Assessment

Authors:Jinwen Tang, Yi Shang

This study introduces ‘Psycho Analyst’, a custom GPT model based on OpenAI’s GPT-4, optimized for pre-screening mental health disorders. Enhanced with DSM-5, PHQ-8, detailed data descriptions, and extensive training data, the model adeptly decodes nuanced linguistic indicators of mental health disorders. It utilizes a dual-task framework that includes binary classification and a three-stage PHQ-8 score computation involving initial assessment, detailed breakdown, and independent assessment, showcasing refined analytic capabilities. Validation with the DAIC-WOZ dataset reveals F1 and Macro-F1 scores of 0.929 and 0.949, respectively, along with the lowest MAE and RMSE of 2.89 and 3.69 in PHQ-8 scoring. These results highlight the model’s precision and transformative potential in enhancing public mental health support, improving accessibility, cost-effectiveness, and serving as a second opinion for professionals.

本研究介绍了“心理分析专家”，这是一个基于OpenAI的GPT-4的自定义GPT模型，经过优化，用于预先筛选精神健康障碍。通过DSM-5、PHQ-8、详细数据描述和大量的训练数据增强，该模型能够熟练地解码精神健康障碍的微妙语言指标。它采用双重任务框架，包括二元分类和涉及初步评估、详细分解和独立评估的三阶段PHQ-8评分计算，展示了精细的分析能力。使用DAIC-WOZ数据集进行验证，结果显示F1和宏观F1得分分别为0.929和0.949，在PHQ-8评分中平均绝对误差和均方根误差最低，分别为2.89和3.69。这些结果突出了该模型的精确度和在增强公众精神健康支持、提高可及性、成本效益以及作为专业人士的第二意见方面的变革潜力。

论文及项目相关链接

PDF Accepted by IEEE CogMI 2024

Summary

一项研究推出了名为“精神分析专家”的GPT模型，该模型基于OpenAI的GPT-4构建，专门用于预先筛选精神健康疾病。该模型结合了DSM-5、PHQ-8等标准工具，通过详细的数据描述和大量的训练数据，能够精准解读语言中微妙的心理健康障碍迹象。采用双重任务框架，包括二元分类和涉及初步评估、详细分解和独立评估的三阶段PHQ-8评分计算，展现出精细的分析能力。使用DAIC-WOZ数据集验证，模型的F1和宏观F1得分分别为0.929和0.949，在PHQ-8评分方面的平均绝对误差和均方根误差最低，分别为2.89和3.69。这一成果展现了模型在公众精神健康支持方面的精准度和革新潜力，提高了可及性和成本效益，可成为专业人士的第二意见参考。

Key Takeaways

“Psycho Analyst”是基于GPT-4的定制模型，专为预筛精神健康疾病设计。
模型结合了DSM-5、PHQ-8等标准工具，具备解读语言中的精神健康障碍迹象的能力。
采用双重任务框架，包括二元分类和PHQ-8评分计算的三阶段过程。
模型展现出色的性能，经DAIC-WOZ数据集验证，F1和宏观F1得分高。
模型在PHQ-8评分方面的平均绝对误差和均方根误差达到较低水平。
模型在公众精神健康支持方面具备精准度和革新潜力。

Cool Papers

点此查看论文截图

DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving

Authors:Yuxuan Tong, Xiwen Zhang, Rui Wang, Ruidong Wu, Junxian He

Solving mathematical problems requires advanced reasoning abilities and presents notable challenges for large language models. Previous works usually synthesize data from proprietary models to augment existing datasets, followed by instruction tuning to achieve top-tier results. However, our analysis of these datasets reveals severe biases towards easy queries, with frequent failures to generate any correct response for the most challenging queries. Hypothesizing that difficult queries are crucial to learn complex reasoning, we propose Difficulty-Aware Rejection Tuning (DART), a method that allocates difficult queries more trials during the synthesis phase, enabling more extensive training on difficult samples. Utilizing DART, we have created new datasets for mathematical problem-solving that focus more on difficult queries and are substantially smaller than previous ones. Remarkably, our synthesis process solely relies on a 7B-sized open-weight model, without reliance on the commonly used proprietary GPT-4. We fine-tune various base models on our datasets ranging from 7B to 70B in size, resulting in a series of strong models called DART-MATH. In comprehensive in-domain and out-of-domain evaluation on 6 mathematical benchmarks, DART-MATH outperforms vanilla rejection tuning significantly, being superior or comparable to previous arts, despite using much smaller datasets and no proprietary models. Furthermore, our results position our synthetic datasets as the most effective and cost-efficient publicly available resources for advancing mathematical problem-solving.

解决数学问题需要高级推理能力，对于大型语言模型来说，这构成了显著的挑战。以前的工作通常从专有模型中合成数据来增强现有数据集，然后通过指令微调来获得顶级结果。然而，我们对这些数据集的分析表明，它们存在严重的偏向简单查询的偏见，对于最具挑战性的查询，经常无法生成任何正确的回应。我们假设困难查询对于学习复杂推理至关重要，因此提出了困难感知拒绝调整（DART）的方法，该方法在合成阶段为困难查询分配更多的试验，使它们在困难样本上进行更广泛的训练。利用DART，我们已经创建了专注于困难查询的数学问题求解新数据集，并且这些数据集比以前的要小得多。值得注意的是，我们的合成过程仅依赖于一个7B大小的开放权重模型，不需要常用的专有GPT-4。我们在自己的数据集上微调了从7B到70B的各种基础模型，从而产生了一系列强大的模型，称为DART-MATH。在6个数学基准的全面领域内部和领域外部评估中，DART-MATH显著优于普通的拒绝调整，并且在之前的技术中表现优越或相当，尽管它使用的数据集更小，而且没有使用专有模型。此外，我们的结果将我们的合成数据集定位为推进数学问题求解方面最有效且最经济的公开资源。

论文及项目相关链接

PDF NeurIPS 2024. Data and model checkpoints are available at https://github.com/hkust-nlp/dart-math

Summary：解决数学问题需要高级推理能力，对大语言模型提出了挑战。过去的研究通常通过合成数据和微调指令来增强现有数据集以实现顶尖结果。然而，分析发现这些数据集存在严重偏向简单查询的问题，对最复杂的查询往往无法给出正确回应。因此，我们提出难度感知拒绝调整（DART）方法，在合成阶段为困难查询分配更多试验次数，以便对困难样本进行更广泛的训练。使用DART，我们创建了专注于困难查询的数学问题求解新数据集，且比之前的数据集小得多。合成过程仅依赖于7B大小的开放权重模型，未使用常见的专有GPT-4。我们在不同大小（从7B到70B）的基准模型上进行微调，形成一系列强大的DART-MATH模型。在6个数学基准测试上进行全面的域内和域外评估，DART-MATH显著优于简单的拒绝调整，且优于或相当于以前的技术，尽管使用了较小的数据集且没有专有模型。此外，我们的结果将我们的合成数据集定位为推进数学问题求解方面最有效且最经济的公开资源。

Key Takeaways：

解决数学问题需要高级推理能力，对大语言模型提出了挑战。
现有数据集存在偏向简单查询的问题，难以应对复杂查询的挑战。
提出难度感知拒绝调整（DART）方法，为困难查询分配更多训练机会。
使用DART创建的新数据集专注于困难查询，且规模较小。
合成过程未依赖专有GPT-4模型，仅使用7B开放权重模型。
DART-MATH模型在多个数学基准测试中表现优异，优于简单拒绝调整和先前技术。

Cool Papers

点此查看论文截图

BEADs: Bias Evaluation Across Domains

Authors:Shaina Raza, Mizanur Rahman, Michael R. Zhang

Recent advancements in large language models (LLMs) have greatly enhanced natural language processing (NLP) applications. Nevertheless, these models often inherit biases from their training data. Despite the availability of various datasets for bias detection, most are limited to one or two NLP tasks (typically classification or evaluation) and lack comprehensive evaluations across a broader range of NLP tasks. To address this gap, we introduce the Bias Evaluations Across Domains BEADs dataset, designed to support a wide array of NLP tasks, including text classification, token classification, bias quantification, and benign language generation. A key focus of this paper is the gold label dataset that is annotated by GPT4 for scalabilty and verified by experts to ensure high reliability. BEADs provides data for both fine-tuning, including classification and language generation tasks, and for evaluating LLMs. Our findings indicate that BEADs effectively identifies numerous biases when fine-tuned on this dataset. It also reduces biases when used for fine-tuning language generation task, while preserving language quality. The results also reveal some prevalent demographic biases in LLMs when BEADs is used for evaluation in demographic task. We provide the BEADs dataset for detecting biases in various domains, and this dataset is readily usable for responsible AI development and application. The dataset can be accessed at https://huggingface.co/datasets/shainar/BEAD .

近期大型语言模型（LLM）的进步极大地增强了自然语言处理（NLP）应用的能力。然而，这些模型往往从其训练数据中继承了偏见。尽管存在各种用于检测偏见的的数据集，但大多数仅限于一个或两个NLP任务（通常为分类或评估），并且在更广泛的NLP任务范围内缺乏综合评估。为了弥补这一空白，我们引入了跨域偏见评估BEADs数据集，旨在支持广泛的NLP任务，包括文本分类、令牌分类、偏见量化和良性语言生成。本文的一个重点是使用GPT4进行标注的金标签数据集，以实现可扩展性，并通过专家验证以确保高可靠性。BEADs既可用于微调（包括分类和语言生成任务），也可用于评估LLM。我们的研究结果表明，在BEADs数据集上微调时，它能有效地识别出大量偏见。在用于微调语言生成任务时，它还能减少偏见，同时保持语言质量。结果还显示，在使用BEADs进行人口统计任务评估时，LLM中存在一些普遍的人口统计偏见。我们提供BEADs数据集用于检测各种领域的偏见，此数据集可立即用于负责任的AI开发和应用。该数据集可通过https://huggingface.co/datasets/shainar/BEAD访问。

论文及项目相关链接

PDF under review

Summary
大型语言模型（LLM）的最新进展极大地促进了自然语言处理（NLP）应用的发展，但这些模型往往继承了训练数据中的偏见。为了解决现有数据集在偏见检测方面的局限性，我们推出了跨域偏见评估数据集（BEADs），支持多种NLP任务，包括文本分类、令牌分类、偏见量化和良性语言生成。该数据集通过GPT4进行标注，以提高可扩展性，并由专家验证以确保高可靠性。研究表明，BEADs在微调过程中能有效识别多种偏见，在语言生成任务中用于微调时，能减少偏见并保持语言质量。然而，在用于评估人口统计任务时，揭示了LLM中一些普遍的人口统计偏见。我们提供BEADs数据集用于检测不同领域的偏见，便于用于开发和应用负责任的人工智能。

Key Takeaways

LLM的最新进展促进了NLP应用的发展，但存在继承偏见的问题。
现有偏见检测数据集通常仅限于一个或两个NLP任务，缺乏全面的评估。
介绍了BEADs数据集，支持多种NLP任务，包括文本分类、令牌分类、偏见量化和良性语言生成。
BEADs数据集通过GPT4进行标注，并由专家验证以确保可靠性。
BEADs能有效识别多种偏见，并在微调过程中减少偏见。
在语言生成任务中，BEADs用于微调时能保持语言质量。

Cool Papers

点此查看论文截图

VIoTGPT: Learning to Schedule Vision Tools in LLMs towards Intelligent Video Internet of Things

Authors:Yaoyao Zhong, Mengshi Qi, Rui Wang, Yuhan Qiu, Yang Zhang, Huadong Ma

Video Internet of Things (VIoT) has shown full potential in collecting an unprecedented volume of video data. How to schedule the domain-specific perceiving models and analyze the collected videos uniformly, efficiently, and especially intelligently to accomplish complicated tasks is challenging. To address the challenge, we build VIoTGPT, the framework based on LLMs to correctly interact with humans, query knowledge videos, and invoke vision models to analyze multimedia data collaboratively. To support VIoTGPT and related future works, we meticulously crafted the VIoT-Tool dataset, including the training dataset and the benchmark involving 11 representative vision models across three categories based on semi-automatic annotations. To guide LLM to act as the intelligent agent towards intelligent VIoT, we resort to the ReAct instruction tuning method based on VIoT-Tool to learn the tool capability. Quantitative and qualitative experiments and analyses demonstrate the effectiveness of VIoTGPT. We believe VIoTGPT contributes to improving human-centered experiences in VIoT applications. The project website is https://github.com/zhongyy/VIoTGPT.

物联网视频（VIoT）已经显示出收集前所未有的大量视频数据的巨大潜力。如何为特定领域感知模型安排日程，并统一、高效、尤其是智能地分析所收集的视频，以完成复杂任务是一个挑战。为了应对这一挑战，我们构建了基于大语言模型（LLM）的VIoTGPT框架，以便正确与人类交互、查询知识视频，并调用视觉模型协同分析多媒体数据。为了支持VIoTGPT和相关的未来工作，我们精心制作了VIoT-Tool数据集，包括训练数据集和基准测试集，涉及基于半自动注释的三个类别中的11个代表性视觉模型。为了引导LLM作为智能物联网的智能代理，我们借助基于VIoT-Tool的ReAct指令调整方法学习工具能力。定量和定性的实验与分析证明了VIoTGPT的有效性。我们相信VIoTGPT有助于提高以人类为中心的物联网视频应用体验。项目网站是https://github.com/zhongyy/VIoTGPT。

论文及项目相关链接

PDF AAAI 2025, 12 pages

Summary

视频物联网（VIoT）收集了大量前所未有的视频数据，如何调度特定领域的感知模型并统一分析这些数据，以完成复杂任务是一大挑战。为解决此问题，我们建立了基于LLM的VIoTGPT框架，可正确与人类互动，查询视频知识并调用视觉模型以协同分析多媒体数据。为支持VIoTGPT及相关未来工作，我们精心构建了VIoT-Tool数据集，包含训练数据集和涉及三个类别中11种代表性视觉模型的基准测试，基于半自动注释。为引导LLM成为面向智能VIoT的智能代理，我们借助VIoT-Tool采用ReAct指令调整方法学习工具能力。定量和定性实验与分析证明了VIoTGPT的有效性。我们相信VIoTGPT有助于提高以人类为中心在VIoT应用中的体验。

Key Takeaways