发布日期: 2025-09-16

更新日期: 2025-10-07

文章字数: 4.2k

阅读时长: 17 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-09-16 更新

Prototypical Contrastive Learning For Improved Few-Shot Audio Classification

Authors:Christos Sgouropoulos, Christos Nikou, Stefanos Vlachos, Vasileios Theiou, Christos Foukanelis, Theodoros Giannakopoulos

Few-shot learning has emerged as a powerful paradigm for training models with limited labeled data, addressing challenges in scenarios where large-scale annotation is impractical. While extensive research has been conducted in the image domain, few-shot learning in audio classification remains relatively underexplored. In this work, we investigate the effect of integrating supervised contrastive loss into prototypical few shot training for audio classification. In detail, we demonstrate that angular loss further improves the performance compared to the standard contrastive loss. Our method leverages SpecAugment followed by a self-attention mechanism to encapsulate diverse information of augmented input versions into one unified embedding. We evaluate our approach on MetaAudio, a benchmark including five datasets with predefined splits, standardized preprocessing, and a comprehensive set of few-shot learning models for comparison. The proposed approach achieves state-of-the-art performance in a 5-way, 5-shot setting.

少样本学习已经成为一种强大的模式，用于在有限标记数据的情况下训练模型，解决大规模标注不切实际的情况下的挑战。虽然图像领域的相关研究已经广泛开展，但在音频分类领域的少样本学习仍然相对被忽视。在这项工作中，我们研究了将监督对比损失集成到原型少样本训练中对音频分类的影响。具体来说，我们证明了与标准对比损失相比，角度损失可以进一步提高性能。我们的方法利用SpecAugment，随后通过自注意力机制将增强输入版本的多样化信息封装到一个统一的嵌入中。我们在MetaAudio上评估了我们的方法，这是一个包括五个数据集、具有预定义分割、标准化预处理和一套用于比较的少样本学习模型的基准测试。所提出的方法在5类5样本的设置中达到了最先进的性能。

论文及项目相关链接

PDF Accepted and Presented at IEEE International Workshop on Machine Learning for Signal Processing, Aug.\ 31– Sep.\ 3, 2025, Istanbul, Turkey , 6 pages, 2 figures, 1 table

Summary

本文探讨了将监督对比损失集成到原型少样本训练中对音频分类的影响，并发现与标准对比损失相比，角损失能提高性能。该研究利用SpecAugment，通过自注意力机制将增强输入版本的多样化信息封装到一个统一的嵌入中。在MetaAudio基准测试上评估该方法，取得了5路、5次射击设置下的最佳性能。

Key Takeaways

文章研究了将监督对比损失与原型少样本学习相结合的方法在音频分类中的应用。
实验发现角损失相较于标准对比损失能提高模型性能。
使用SpecAugment进行数据增强，并利用自注意力机制提取音频特征。
研究在一个统一的嵌入空间中处理多样化信息的重要性。
在MetaAudio基准测试上进行了评估，验证了方法的有效性。
文章所提方法在预设分裂、标准化预处理以及设定的少数类学习模型等情境下实现了卓越性能。

Cool Papers

点此查看论文截图

Multi-Intent Recognition in Dialogue Understanding: A Comparison Between Smaller Open-Source LLMs

Authors:Adnan Ahmad, Philine Kowol, Stefan Hillmann, Sebastian Möller

In this paper, we provide an extensive analysis of multi-label intent classification using Large Language Models (LLMs) that are open-source, publicly available, and can be run in consumer hardware. We use the MultiWOZ 2.1 dataset, a benchmark in the dialogue system domain, to investigate the efficacy of three popular open-source pre-trained LLMs, namely LLama2-7B-hf, Mistral-7B-v0.1, and Yi-6B. We perform the classification task in a few-shot setup, giving 20 examples in the prompt with some instructions. Our approach focuses on the differences in performance of these models across several performance metrics by methodically assessing these models on multi-label intent classification tasks. Additionally, we compare the performance of the instruction-based fine-tuning approach with supervised learning using the smaller transformer model BertForSequenceClassification as a baseline. To evaluate the performance of the models, we use evaluation metrics like accuracy, precision, and recall as well as micro, macro, and weighted F1 score. We also report the inference time, VRAM requirements, etc. The Mistral-7B-v0.1 outperforms two other generative models on 11 intent classes out of 14 in terms of F-Score, with a weighted average of 0.50. It also has relatively lower Humming Loss and higher Jaccard Similarity, making it the winning model in the few-shot setting. We find BERT based supervised classifier having superior performance compared to the best performing few-shot generative LLM. The study provides a framework for small open-source LLMs in detecting complex multi-intent dialogues, enhancing the Natural Language Understanding aspect of task-oriented chatbots.

在这篇论文中，我们利用大型语言模型（LLM）进行了全面的多标签意图分类分析。这些模型是开源的、可公开访问的，可以在消费者硬件上运行。我们使用MultiWOZ 2.1数据集（对话系统领域的基准测试数据集）来研究三种流行开源预训练LLM的效果，分别是LLama2-7B-hf、Mistral-7B-v0.1和Yi-6B。我们在少量样本设置下执行分类任务，提示中给出20个示例和一些指令。我们的方法侧重于在多个性能指标上评估这些模型在多标签意图分类任务上的性能差异。此外，我们将基于指令的微调方法与使用较小的转换器模型BertForSequenceClassification的基于监督学习的方法进行比较，作为基线。为了评估模型的性能，我们使用准确率、精确率和召回率以及微平均、宏平均和加权F1分数等评价指标。我们还报告了推理时间、VRAM要求等。在F分数方面，Mistral-7B-v0.1在14个意图类别中的11个上优于另外两个生成模型，加权平均为0.50。它的Humming损失相对较低，Jaccard相似度较高，因此在小样本情况下成为获胜模型。我们发现基于BERT的监督分类器具有比最佳性能的少样本生成LLM更优越的性能。该研究为检测复杂多意图对话的小型开源LLM提供了一个框架，增强了任务导向型聊天机器人的自然语言理解方面。

论文及项目相关链接

PDF

Summary
本文探讨了使用大型语言模型（LLM）进行多标签意图分类的广泛分析。研究使用MultiWOZ 2.1数据集，调查了三个流行的开源预训练LLM的性能差异，包括LLama2-7B-hf、Mistral-7B-v0.1和Yi-6B。该研究采用少样本设置进行分类任务，重点评估这些模型在多标签意图分类任务上的性能差异，并与基于BERT的序列分类基线模型进行对比。研究表明，Mistral-7B-v0.1在F得分、加权平均等指标上表现最佳。

Key Takeaways

研究使用MultiWOZ 2.1数据集对三个开源预训练的大型语言模型（LLM）进行多标签意图分类分析。
在少样本设置下，这些LLM模型在分类任务中的性能被评估。
Mistral-7B-v0.1在多标签意图分类任务上的性能优于其他两个生成模型。
在F得分、加权平均等指标上，Mistral-7B-v0.1表现出最佳性能，并具有较低的Humming Loss和较高的Jaccard相似性。
与基于BERT的监督分类器相比，最佳性能的少样本生成LLM仍有所不足。
研究提供了一个框架，用于使用小型开源LLM检测复杂的多意图对话，增强任务导向型聊天机器人的自然语言理解方面。

Cool Papers

点此查看论文截图

Automated Tuning for Diffusion Inverse Problem Solvers without Generative Prior Retraining

Authors:Yaşar Utku Alçalar, Junno Yun, Mehmet Akçakaya

Diffusion/score-based models have recently emerged as powerful generative priors for solving inverse problems, including accelerated MRI reconstruction. While their flexibility allows decoupling the measurement model from the learned prior, their performance heavily depends on carefully tuned data fidelity weights, especially under fast sampling schedules with few denoising steps. Existing approaches often rely on heuristics or fixed weights, which fail to generalize across varying measurement conditions and irregular timestep schedules. In this work, we propose Zero-shot Adaptive Diffusion Sampling (ZADS), a test-time optimization method that adaptively tunes fidelity weights across arbitrary noise schedules without requiring retraining of the diffusion prior. ZADS treats the denoising process as a fixed unrolled sampler and optimizes fidelity weights in a self-supervised manner using only undersampled measurements. Experiments on the fastMRI knee dataset demonstrate that ZADS consistently outperforms both traditional compressed sensing and recent diffusion-based methods, showcasing its ability to deliver high-fidelity reconstructions across varying noise schedules and acquisition settings.

扩散/评分模型最近作为解决逆问题的强大生成先验而出现，包括加速MRI重建。虽然它们的灵活性允许将测量模型与所学的先验解耦，但它们的性能严重依赖于精心调整的数据保真度权重，尤其是在具有较少去噪步骤的快速采样时间表下。现有方法常常依赖于启发式或固定权重，这些无法在不同的测量条件和不规则的时间步长计划中推广。在这项工作中，我们提出了零样本自适应扩散采样（ZADS），这是一种测试时优化方法，可以在不需要重新训练扩散先验的情况下，自适应地调整跨任意噪声时间表的保真度权重。ZADS将去噪过程视为一个固定的展开采样器，并以自我监督的方式仅使用欠采样测量来优化保真度权重。在fastMRI膝关节数据集上的实验表明，ZADS始终优于传统的压缩感知和最新的基于扩散的方法，展示了其在不同的噪声时间表和采集设置下提供高保真重建的能力。

论文及项目相关链接

PDF IEEE International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP), 2025

Summary

扩散模型/分数模型在解决逆问题方面展现出强大的生成先验能力，包括加速MRI重建。其灵活性使得可以将测量模型与学习的先验分离，但其性能高度依赖于精心调整的数据保真权重，特别是在具有少量去噪步骤的快速采样计划下。现有方法常常依赖于启发式或固定权重，无法在不同测量条件和不规则的时间步长计划之间实现泛化。在这项研究中，我们提出了零样本自适应扩散采样（ZADS），这是一种测试时优化方法，可以在任意噪声计划中自适应调整保真权重，无需重新训练扩散先验。ZADS将去噪过程视为固定的展开采样器，仅使用欠采样测量以自监督的方式优化保真权重。在fastMRI膝关节数据集上的实验表明，ZADS始终优于传统的压缩感知和最新的扩散方法，展现出在不同噪声计划和采集设置中提供高质量重建的能力。

Key Takeaways

扩散模型/分数模型在解决MRI重建等逆问题上具有强大的生成先验能力。
这些模型的性能受到数据保真权重的严重影响，尤其在快速采样计划下。
现有方法依赖于启发式或固定权重，难以适应不同的测量条件和时间步长计划。
提出了零样本自适应扩散采样（ZADS）方法，可以在测试时自适应调整保真权重。
ZADS将去噪过程视为固定展开采样器，通过自监督方式优化保真权重。
ZADS在fastMRI膝关节数据集上的实验表现优于传统压缩感知和最新扩散方法。

Cool Papers

点此查看论文截图

MimicDroid: In-Context Learning for Humanoid Robot Manipulation from Human Play Videos

Authors:Rutav Shah, Shuijing Liu, Qi Wang, Zhenyu Jiang, Sateesh Kumar, Mingyo Seo, Roberto Martín-Martín, Yuke Zhu

We aim to enable humanoid robots to efficiently solve new manipulation tasks from a few video examples. In-context learning (ICL) is a promising framework for achieving this goal due to its test-time data efficiency and rapid adaptability. However, current ICL methods rely on labor-intensive teleoperated data for training, which restricts scalability. We propose using human play videos – continuous, unlabeled videos of people interacting freely with their environment – as a scalable and diverse training data source. We introduce MimicDroid, which enables humanoids to perform ICL using human play videos as the only training data. MimicDroid extracts trajectory pairs with similar manipulation behaviors and trains the policy to predict the actions of one trajectory conditioned on the other. Through this process, the model acquired ICL capabilities for adapting to novel objects and environments at test time. To bridge the embodiment gap, MimicDroid first retargets human wrist poses estimated from RGB videos to the humanoid, leveraging kinematic similarity. It also applies random patch masking during training to reduce overfitting to human-specific cues and improve robustness to visual differences. To evaluate few-shot learning for humanoids, we introduce an open-source simulation benchmark with increasing levels of generalization difficulty. MimicDroid outperformed state-of-the-art methods and achieved nearly twofold higher success rates in the real world. Additional materials can be found on: ut-austin-rpl.github.io/MimicDroid

我们的目标是为类人机器人提供能力，使其能够从少量视频示例中有效地解决新的操作任务。由于其在测试时的数据效率和快速适应性，上下文学习（ICL）是一个实现此目标的非常有前景的框架。然而，当前的ICL方法依赖于劳动密集型的遥控数据进行训练，这限制了其可扩展性。我们建议使用人类游戏视频——人们自由地与周围环境互动的无间断、无标签视频——作为可扩展和多样化的训练数据来源。我们推出了MimicDroid，它能够让类人机器人使用人类游戏视频作为唯一训练数据进行上下文学习。MimicDroid提取具有相似操作行为的轨迹对，并训练策略以根据一个轨迹预测另一个轨迹的动作。通过这种方式，模型获得了在测试时适应新对象和环境的能力。为了弥补体现之间的差距，MimicDroid首先会根据RGB视频估计的人类手腕姿态将其重定向到类人机器人上，利用运动学相似性。在训练过程中应用随机补丁掩蔽技术，以减少对人为线索的过拟合并提高对视觉差异的鲁棒性。为了评估类人机器人的小样本学习能力，我们引入了一个开源仿真基准测试平台，该平台逐渐增加泛化难度。MimicDroid在现实中取得了超越最新技术方法的表现，实现了近乎翻倍的成功率。更多资料可在以下网址找到：ut-austin-rpl.github.io/MimicDroid。

论文及项目相关链接

PDF 11 pages, 9 figures, 5 tables

Summary

本文旨在让人形机器人能够从少量视频示例中有效地解决新的操作任务。提出使用人类游戏视频作为可扩展和多样化的训练数据来源。MimicDroid方法利用人类游戏视频作为唯一训练数据，通过提取具有相似操作行为的轨迹对，训练策略以预测一个轨迹在另一个轨迹条件下的动作。这种方法使模型具备了在测试时适应新物体和环境的能力。为了缩小实体形态差距，MimicDroid利用运动学相似性将估计的人类手腕姿态从RGB视频转移到人形机器人上。通过随机补丁屏蔽训练过程，减少了对人类特定线索的过拟合，提高了对视觉差异的鲁棒性。评价人形机器人的小样本学习时，引入了一个开源仿真基准，随着泛化难度的增加，MimicDroid表现优异，成功率几乎是现有技术方法的一倍。

Key Takeaways