⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-11-16 更新
Reinforcing Trustworthiness in Multimodal Emotional Support Systems
Authors:Huy M. Le, Dat Tien Nguyen, Ngan T. T. Vo, Tuan D. Q. Nguyen, Nguyen Binh Le, Duy Minh Ho Nguyen, Daniel Sonntag, Lizi Liao, Binh T. Nguyen
In today’s world, emotional support is increasingly essential, yet it remains challenging for both those seeking help and those offering it. Multimodal approaches to emotional support show great promise by integrating diverse data sources to provide empathetic, contextually relevant responses, fostering more effective interactions. However, current methods have notable limitations, often relying solely on text or converting other data types into text, or providing emotion recognition only, thus overlooking the full potential of multimodal inputs. Moreover, many studies prioritize response generation without accurately identifying critical emotional support elements or ensuring the reliability of outputs. To overcome these issues, we introduce \textsc{ MultiMood}, a new framework that (i) leverages multimodal embeddings from video, audio, and text to predict emotional components and to produce responses responses aligned with professional therapeutic standards. To improve trustworthiness, we (ii) incorporate novel psychological criteria and apply Reinforcement Learning (RL) to optimize large language models (LLMs) for consistent adherence to these standards. We also (iii) analyze several advanced LLMs to assess their multimodal emotional support capabilities. Experimental results show that MultiMood achieves state-of-the-art on MESC and DFEW datasets while RL-driven trustworthiness improvements are validated through human and LLM evaluations, demonstrating its superior capability in applying a multimodal framework in this domain.
在如今的世界中,情感支持变得愈发重要,但对于寻求帮助和提供帮助的人来说仍然是一个挑战。情感支持的多模式方法显示出巨大的潜力,它通过整合各种数据源来提供富有同情心和语境相关的回应,促进更有效的互动。然而,当前的方法存在明显的局限性,通常仅依赖文本或将其他数据类型转换为文本,或者仅提供情绪识别,从而忽略了多模式输入的全面潜力。此外,许多研究优先于响应生成,未能准确识别关键的情感支持元素或确保输出的可靠性。为了克服这些问题,我们引入了\textbf{MultiMood}新框架,它(i)利用视频、音频和文本的跨模式嵌入来预测情感成分,并产生与专业治疗标准相符的响应。为了提高可信度,我们(ii)融入新颖的心理标准并应用强化学习(RL)来优化大型语言模型(LLM),使其始终符合这些标准。我们还(iii)分析了几种先进的大型语言模型,以评估其在多模式情感支持方面的能力。实验结果表明,MultiMood在MESC和DFEW数据集上达到最新水平,通过人类和大型语言模型评估验证了强化学习驱动的可信性改进,证明了其在该领域应用多模式框架的卓越能力。
论文及项目相关链接
Summary:
当前情感支持的需求日益增长,但寻求帮助和提供帮助的人都面临挑战。多模式情感支持方法通过整合各种数据源来提供富有同情心和情境相关的回应,展现出巨大潜力。然而,现有方法存在诸多局限,如过于依赖文本或转换其他数据类型为文本,或仅提供情绪识别,忽略了多模式输入的全面潜力。为克服这些问题,我们推出MultiMood框架,利用视频、音频和文本的多元嵌入来预测情绪成分并生成与专业治疗标准对齐的回应。为提高可信度,我们融入新的心理标准并使用强化学习优化大型语言模型,确保其持续符合标准。实验结果显示,MultiMood在MESC和DFEW数据集上达到最新水平,同时RL驱动的信任度改进通过人类和LLM评估得到验证,证明其在该领域应用多模式框架的卓越能力。
Key Takeaways:
- 当前情感支持的重要性及其挑战。
- 多模式情感支持方法通过整合不同数据源展现出巨大潜力。
- 现有方法的局限在于过于依赖文本或仅提供情绪识别,忽略了多模式数据的全面潜力。
- MultiMood框架利用视频、音频和文本的多元嵌入来预测情绪成分。
- MultiMood框架生成与专业治疗标准对齐的回应。
- 通过融入新的心理标准和强化学习优化大型语言模型,提高MultiMood框架的可信度。
点此查看论文截图
HI-TransPA: Hearing Impairments Translation Personal Assistant
Authors:Zhiming Ma, Shiyu Gan, Junhao Zhao, Xianming Li, Qingyun Pan, Peidong Wang, Mingjun Pan, Yuhao Mo, Jiajie Cheng, Chengxin Chen, Zhonglun Cao, Chonghan Liu, Shi Cheng
To provide a unified and flexible solution for daily communication among hearing-impaired individuals, we introduce the Omni-Model paradigm into assistive technology and present HI-TransPA, an instruction-driven audio-visual personal assistant. The model fuses indistinct speech with high-frame-rate lip dynamics, enabling both translation and dialogue within a single multimodal framework. To tackle the challenges of noisy and heterogeneous raw data and the limited adaptability of existing Omni-Models to hearing-impaired speech, we construct a comprehensive preprocessing and curation pipeline that detects facial landmarks, isolates and stabilizes the lip region, and quantitatively assesses multimodal sample quality. These quality scores guide a curriculum learning strategy that first trains on clean, high-confidence samples and progressively incorporates harder cases to strengthen model robustness. We further adopt a SigLIP encoder combined with a Unified 3D-Resampler to efficiently encode high-frame-rate lip motion. Experiments on our purpose-built HI-Dialogue dataset show that HI-TransPA achieves state-of-the-art performance in both literal accuracy and semantic fidelity. This work establishes a foundation for applying Omni-Models to assistive communication technology, providing an end-to-end modeling framework and essential processing tools for future research.
为了为听力受损人士的日常沟通提供统一和灵活解决方案,我们将Omni-Model范式引入辅助技术,并推出了HI-TransPA这一指令驱动的视听个人助理。该模型融合了模糊语音和高帧率唇动态,能够在单一的多模式框架内实现翻译和对话。为了应对原始数据的嘈杂和异质性以及现有Omni-Models对听力受损语音的有限适应性,我们构建了一个全面的预处理和筛选管道,用于检测面部地标、隔离和稳定唇部区域,并定量评估多模式样本质量。这些质量分数引导了一种课程学习策略,该策略首先以干净、高信心的样本进行训练,并逐步加入更困难的案例以增强模型的稳健性。我们进一步采用SigLIP编码器结合Unified 3D-Resampler高效编码高帧率唇动。在我们专门构建的HI-Dialogue数据集上的实验表明,HI-TransPA在字面准确率和语义保真度方面都达到了最新技术水平。这项工作为将Omni-Models应用于辅助通信技术奠定了基础,为未来研究提供了端到端的建模框架和必要的处理工具。
论文及项目相关链接
Summary
一种融合了视听信息的多模态交流助手HI-TransPA问世,专门为听障人群提供日常交流的解决方案。它通过捕捉不易辨识的语音信号与嘴唇动作的高帧率数据,结合面部定位和深度学习算法进行语音识别与对话,形成单模态和多模态综合对话的新模式。本文详细描述了为解决该应用过程中的挑战如处理原始数据的噪声和多样性问题、Omni模型对听障人群的适应性限制等所构建的预处理和筛选管道,包括面部地标检测、唇部区域隔离和稳定等关键步骤。采用基于SigLIP编码器和统一三维重采样技术的策略有效提高了高帧率唇部运动的编码效率。在定制化的HI-Dialogue数据集上的实验证明HI-TransPA在准确度和语义保真度上达到了领先水平。此研究为Omni模型在辅助通讯技术中的应用奠定了坚实的基础。
Key Takeaways
- HI-TransPA结合了视听信息,为听障人群提供日常交流解决方案。
- HI-TransPA融合语音信号和嘴唇动作数据以支持对话翻译和对语音的处理分析。
点此查看论文截图
Fairness-Aware Few-Shot Learning for Audio-Visual Stress Detection
Authors:Anushka Sanjay Shelke, Aditya Sneh, Arya Adyasha, Haroon R. Lone
Fairness in AI-driven stress detection is critical for equitable mental healthcare, yet existing models frequently exhibit gender bias, particularly in data-scarce scenarios. To address this, we propose FairM2S, a fairness-aware meta-learning framework for stress detection leveraging audio-visual data. FairM2S integrates Equalized Odds constraints during both meta-training and adaptation phases, employing adversarial gradient masking and fairness-constrained meta-updates to effectively mitigate bias. Evaluated against five state-of-the-art baselines, FairM2S achieves 78.1% accuracy while reducing the Equal Opportunity to 0.06, demonstrating substantial fairness gains. We also release SAVSD, a smartphone-captured dataset with gender annotations, designed to support fairness research in low-resource, real-world contexts. Together, these contributions position FairM2S as a state-of-the-art approach for equitable and scalable few-shot stress detection in mental health AI. We release our dataset and FairM2S publicly with this paper.
人工智能驱动的应激检测中的公平性对于公平的心理健康护理至关重要,然而现有的模型经常表现出性别偏见,特别是在数据稀缺的情况下。为了解决这个问题,我们提出了FairM2S,这是一个利用视听数据的应激检测公平性感知元学习框架。FairM2S在元训练和适应阶段都整合了均衡机会约束,采用对抗性梯度掩蔽和公平性约束元更新来有效缓解偏见。与五个最先进的基线相比,FairM2S达到了78.1%的准确率,同时将平等机会降至0.06,显示出显著的公平性提升。我们还发布了SAVSD,这是一个带有性别注释的智能手机捕捉数据集,旨在支持低资源、现实世界的公平性研究。FairM2S的这些贡献使其成为心理健康人工智能领域中先进、公平且可扩展的少数镜头应激检测的首选方法。我们随论文公开发布我们的数据集和FairM2S。
论文及项目相关链接
Summary
人工智能驱动的公平压力检测对公平精神卫生保健至关重要,但现有模型在数据稀缺的场景中经常出现性别偏见。为解决这一问题,我们提出了FairM2S,这是一个利用视听数据的压力检测公平性感知元学习框架。FairM2S在元训练和适应阶段都整合了均衡机会约束,采用对抗性梯度掩蔽和公平性约束元更新来有效缓解偏见。与五种最新技术基线相比,FairM2S实现了78.1%的准确率,同时将公平机会降至0.06,显示出实质性的公平性进展。我们还发布了SAVSD数据集,该数据集采用智能手机捕获并带有性别注释,旨在支持低资源、现实环境中的公平性研究。总体而言,FairM2S被认为是心理健康人工智能领域中公平且可扩展的少数压力检测的最先进方法。我们公开发布了数据集和FairM2S。
Key Takeaways
- 人工智能在压力检测中的公平性对精神卫生保健的公平性至关重要。
- 现有模型在数据稀缺的场景中存在性别偏见问题。
- FairM2S是一个利用视听数据的压力检测公平性感知元学习框架,旨在解决上述问题。
- FairM2S通过整合均衡机会约束在元训练和适应阶段实现公平性。
- FairM2S取得了较高的准确率,并显著降低了公平机会指数。
- 发布了SAVSD数据集,用于支持低资源、现实环境中的公平性研究。
点此查看论文截图
TiDAR: Think in Diffusion, Talk in Autoregression
Authors:Jingyu Liu, Xin Dong, Zhifan Ye, Rishabh Mehta, Yonggan Fu, Vartika Singh, Jan Kautz, Ce Zhang, Pavlo Molchanov
Diffusion language models hold the promise of fast parallel generation, while autoregressive (AR) models typically excel in quality due to their causal structure aligning naturally with language modeling. This raises a fundamental question: can we achieve a synergy with high throughput, higher GPU utilization, and AR level quality? Existing methods fail to effectively balance these two aspects, either prioritizing AR using a weaker model for sequential drafting (speculative decoding), leading to lower drafting efficiency, or using some form of left-to-right (AR-like) decoding logic for diffusion, which still suffers from quality degradation and forfeits its potential parallelizability. We introduce TiDAR, a sequence-level hybrid architecture that drafts tokens (Thinking) in Diffusion and samples final outputs (Talking) AutoRegressively - all within a single forward pass using specially designed structured attention masks. This design exploits the free GPU compute density, achieving a strong balance between drafting and verification capacity. Moreover, TiDAR is designed to be serving-friendly (low overhead) as a standalone model. We extensively evaluate TiDAR against AR models, speculative decoding, and diffusion variants across generative and likelihood tasks at 1.5B and 8B scales. Thanks to the parallel drafting and sampling as well as exact KV cache support, TiDAR outperforms speculative decoding in measured throughput and surpasses diffusion models like Dream and Llada in both efficiency and quality. Most notably, TiDAR is the first architecture to close the quality gap with AR models while delivering 4.71x to 5.91x more tokens per second.
扩散语言模型具有快速并行生成的潜力,而自回归(AR)模型通常因其自然符合语言建模的因果结构而在质量上表现出色。这引发了一个根本性的问题:我们能否实现高吞吐量、更高的GPU利用率和AR级质量的协同?现有方法无法有效地平衡这两方面,要么使用较弱的模型进行顺序起草(推测解码),优先考虑AR,导致起草效率降低,要么采用某种形式的自左至右(AR类似)解码逻辑进行扩散,这仍然会导致质量下降并丧失了其潜在的并行性。我们引入了TiDAR,这是一个序列级混合架构,在扩散中进行符号(思考)起草,并以自回归方式(说话)采样最终输出——所有这些都使用专门设计的结构化注意力掩码在一次前向传递中完成。这种设计利用了免费的GPU计算密度,在起草和验证容量之间实现了强大的平衡。此外,TiDAR被设计为作为独立模型友好服务(低开销)。我们对TiDAR进行了广泛的评估,与AR模型、推测解码以及扩散变体在生成和可能性任务上的1.5B和8B规模进行了比较。由于并行起草和采样以及精确的KV缓存支持,TiDAR在衡量吞吐量上优于推测解码,并且在效率和质量上超过了如Dream和Llada等扩散模型。尤其值得一提的是,TiDAR是首个在质量上缩小与AR模型差距的架构,同时每秒处理令牌的数量提高了4.71倍至5.91倍。
论文及项目相关链接
PDF NVIDIA-Tech Report
Summary
本文探讨了扩散语言模型与自回归模型的融合问题,旨在实现快速并行生成与高质输出。为此,提出一种名为TiDAR的序列级混合架构,通过一次前向传递实现扩散中的令牌草稿和最终输出的自回归采样。该设计利用免费GPU计算密度,在草稿和验证容量之间实现平衡。经过评估,TiDAR在生成和可能性任务方面表现优异,以并行草稿和采样以及精确KV缓存支持为特点,在吞吐量和效率方面优于其他模型。特别是TiDAR首次缩小了与自回归模型的品质差距,同时提高了每秒令牌生成速度。
Key Takeaways
- 扩散语言模型与自回归模型的融合旨在实现快速并行生成与高质输出。
- TiDAR是一种序列级混合架构,结合了扩散模型和自回归采样的优点。
- TiDAR通过一次前向传递实现令牌草稿和最终输出的自回归采样。
- TiDAR利用GPU计算资源,在草稿和验证容量之间取得平衡。
- TiDAR在生成和可能性任务方面表现优于其他模型,特别是在吞吐量和效率方面。
- TiDAR缩小了与自回归模型的品质差距。
点此查看论文截图
Is It Truly Necessary to Process and Fit Minutes-Long Reference Videos for Personalized Talking Face Generation?
Authors:Rui-Qing Sun, Ang Li, Zhijing Wu, Tian Lan, Qianyu Lu, Xingshan Yao, Chen Xu, Xian-Ling Mao
Talking Face Generation (TFG) aims to produce realistic and dynamic talking portraits, with broad applications in fields such as digital education, film and television production, e-commerce live streaming, and other related areas. Currently, TFG methods based on Neural Radiated Field (NeRF) or 3D Gaussian sputtering (3DGS) are received widespread attention. They learn and store personalized features from reference videos of each target individual to generate realistic speaking videos. To ensure models can capture sufficient 3D information and successfully learns the lip-audio mapping, previous studies usually require meticulous processing and fitting several minutes of reference video, which always takes hours. The computational burden of processing and fitting long reference videos severely limits the practical application value of these methods.However, is it really necessary to fit such minutes of reference video? Our exploratory case studies show that using some informative reference video segments of just a few seconds can achieve performance comparable to or even better than the full reference video. This indicates that video informative quality is much more important than its length. Inspired by this observation, we propose the ISExplore (short for Informative Segment Explore), a simple-yet-effective segment selection strategy that automatically identifies the informative 5-second reference video segment based on three key data quality dimensions: audio feature diversity, lip movement amplitude, and number of camera views. Extensive experiments demonstrate that our approach increases data processing and training speed by more than 5x for NeRF and 3DGS methods, while maintaining high-fidelity output. Project resources are available at xx.
面部生成(TFG)旨在生成逼真且动态的说话肖像,在数字教育、影视制作、电商直播等领域有广泛应用。目前,基于神经辐射场(NeRF)或三维高斯溅射(3DGS)的TFG方法受到广泛关注。它们从目标个体的参考视频中学习并存储个性化特征,以生成逼真的说话视频。为确保模型能够捕获足够的三维信息并成功学习唇音映射,之前的研究通常需要精细处理和拟合几分钟的参考视频,这一过程往往需要数小时。处理和拟合长参考视频的计算负担严重限制了这些方法在实际应用中的价值。但是,真的有必要拟合这么久的参考视频吗?我们的探索性案例研究表明,使用几秒的参考视频片段就能达到与完整参考视频相当甚至更好的性能,这表明视频的信息质量远比其长度重要。受此观察启发,我们提出了ISExplore(信息片段探索的简称),这是一种简单有效的片段选择策略,它自动确定信息丰富的5秒参考视频片段,基于三个关键数据质量维度:音频特征多样性、嘴唇运动幅度和摄像头数量。大量实验表明,我们的方法在提高NeRF和3DGS方法的数据处理和训练速度的同时,保持了高保真输出。项目资源可在xx处获取。
论文及项目相关链接
Summary
基于神经辐射场(NeRF)或3D高斯溅射(3DGS)的说话面孔生成(TFG)技术正受到广泛关注,用于创建逼真的动态肖像,广泛应用于数字教育、影视制作、电商直播等领域。传统方法需要处理并适配几分钟的参考视频以确保捕捉到足够的3D信息和唇音映射,这一过程十分繁琐并限制了其实际应用价值。本研究发现,仅使用几秒的参考视频片段便能实现与完整视频相近或更佳的性能,突显视频信息质量较视频长度更为重要。受此启发,研究提出ISExplore策略,可自动辨识最具信息量的5秒参考视频片段,基于音频特征多样性、唇部动作幅度和摄像头视角数量三个关键数据质量维度。实验显示,该方法可提升NeRF和3DGS方法的数据处理速度和训练速度至5倍以上,同时保持高保真输出。
Key Takeaways
- 说话面孔生成技术受到广泛关注,应用于多个领域。
- 传统方法需处理长时间参考视频,具有计算负担。
- 研究发现使用几秒的信息量大的视频片段可达到甚至超越使用完整视频的性能。
- 视频信息质量比长度更重要。
- ISExplore策略可自动选择最具信息量的5秒视频片段。
- ISExplore策略提升了数据处理和训练速度。
点此查看论文截图
SpikCommander: A High-performance Spiking Transformer with Multi-view Learning for Efficient Speech Command Recognition
Authors:Jiaqi Wang, Liutao Yu, Xiongri Shen, Sihang Guo, Chenlin Zhou, Leilei Zhao, Yi Zhong, Zhiguo Zhang, Zhengyu Ma
Spiking neural networks (SNNs) offer a promising path toward energy-efficient speech command recognition (SCR) by leveraging their event-driven processing paradigm. However, existing SNN-based SCR methods often struggle to capture rich temporal dependencies and contextual information from speech due to limited temporal modeling and binary spike-based representations. To address these challenges, we first introduce the multi-view spiking temporal-aware self-attention (MSTASA) module, which combines effective spiking temporal-aware attention with a multi-view learning framework to model complementary temporal dependencies in speech commands. Building on MSTASA, we further propose SpikCommander, a fully spike-driven transformer architecture that integrates MSTASA with a spiking contextual refinement channel MLP (SCR-MLP) to jointly enhance temporal context modeling and channel-wise feature integration. We evaluate our method on three benchmark datasets: the Spiking Heidelberg Dataset (SHD), the Spiking Speech Commands (SSC), and the Google Speech Commands V2 (GSC). Extensive experiments demonstrate that SpikCommander consistently outperforms state-of-the-art (SOTA) SNN approaches with fewer parameters under comparable time steps, highlighting its effectiveness and efficiency for robust speech command recognition.
脉冲神经网络(Spiking Neural Networks,简称SNNs)通过利用其事件驱动的处理模式,为实现能源高效的语音指令识别(Speech Command Recognition,简称SCR)提供了有前景的路径。然而,现有的基于SNN的SCR方法在捕捉语音中的丰富时间依赖性和上下文信息方面常常遇到困难,这主要是由于有限的时间建模和基于二进制的脉冲表示。为了应对这些挑战,我们首先引入了多视角脉冲时间感知自注意力(Multi-view Spiking Temporal-aware Self-Attention,简称MSTASA)模块,该模块结合了有效的脉冲时间感知注意力与多视角学习框架,以模拟语音指令中的互补时间依赖性。基于MSTASA,我们进一步提出了SpikCommander,这是一种完全由脉冲驱动的变压器架构,它将MSTASA与脉冲上下文细化通道MLP(SCR-MLP)相结合,以共同增强时间上下文建模和通道级特征集成。我们在三个基准数据集上评估了我们的方法:脉冲海德堡数据集(Spiking Heidelberg Dataset,简称SHD)、脉冲语音指令(Spiking Speech Commands,简称SSC)和谷歌语音指令V2(Google Speech Commands V2,简称GSC)。大量实验表明,SpikCommander在参数较少的情况下始终优于最新的SNN方法,并且在相似的时间步长下表现出其有效性和高效性,为稳健的语音指令识别提供了强有力的支持。
论文及项目相关链接
PDF Accepted by The Fortieth AAAI Conference on Artificial Intelligence (AAAI 2026)
Summary
该文介绍了利用脉冲神经网络(SNNs)实现能效高的语音指令识别(SCR)。针对现有方法难以捕捉语音中的丰富时间依赖性和上下文信息的问题,文章提出了多视角脉冲时序感知自注意力(MSTASA)模块和SpikCommander方法。MSTASA结合了有效的脉冲时序感知注意力和多视角学习框架,以模拟语音指令中的互补时间依赖性。在此基础上,SpikCommander是一个完全由脉冲驱动的转换器架构,它将MSTASA与脉冲上下文细化通道MLP(SCR-MLP)相结合,共同增强了时间上下文建模和通道级特征集成。实验结果表明,SpikCommander在参数较少、时间步长相似的情况下,始终优于最新的SNN方法,显示出其在稳健语音指令识别方面的有效性和高效性。
Key Takeaways
- 脉冲神经网络(SNNs)为实现能效高的语音指令识别(SCR)提供了前景。
- 现有SNN-based SCR方法面临捕捉语音中丰富的时间依赖性和上下文信息的挑战。
- 引入MSTASA模块,结合脉冲时序感知注意力和多视角学习框架,以模拟语音指令中的互补时间依赖性。
- 提出SpikCommander方法,是一个完全由脉冲驱动的转换器架构,结合了MSTASA和脉冲上下文细化通道MLP(SCR-MLP)。
- SpikCommander增强了时间上下文建模和通道级特征集成。
- 实验表明,SpikCommander在参数和时间步长相似的情况下,性能优于最新的SNN方法。
点此查看论文截图
CAVER: Curious Audiovisual Exploring Robot
Authors:Luca Macesanu, Boueny Folefack, Samik Singh, Ruchira Ray, Ben Abbatematteo, Roberto Martín-Martín
Multimodal audiovisual perception can enable new avenues for robotic manipulation, from better material classification to the imitation of demonstrations for which only audio signals are available (e.g., playing a tune by ear). However, to unlock such multimodal potential, robots need to learn the correlations between an object’s visual appearance and the sound it generates when they interact with it. Such an active sensorimotor experience requires new interaction capabilities, representations, and exploration methods to guide the robot in efficiently building increasingly rich audiovisual knowledge. In this work, we present CAVER, a novel robot that builds and utilizes rich audiovisual representations of objects. CAVER includes three novel contributions: 1) a novel 3D printed end-effector, attachable to parallel grippers, that excites objects’ audio responses, 2) an audiovisual representation that combines local and global appearance information with sound features, and 3) an exploration algorithm that uses and builds the audiovisual representation in a curiosity-driven manner that prioritizes interacting with high uncertainty objects to obtain good coverage of surprising audio with fewer interactions. We demonstrate that CAVER builds rich representations in different scenarios more efficiently than several exploration baselines, and that the learned audiovisual representation leads to significant improvements in material classification and the imitation of audio-only human demonstrations. https://caver-bot.github.io/
多模态视听感知能为机器人操作开辟新的途径,从更好的材料分类到模仿只有音频信号的演示(例如,凭耳力演奏曲调)。然而,为了解锁这种多模态潜力,机器人需要学习物体外观与其交互时产生的声音之间的关联。这种活跃的感觉运动经验需要新的交互能力、表示方法和探索方法来指导机器人有效地构建日益丰富的视听知识。在这项工作中,我们介绍了CAVER,这是一种构建和利用物体丰富视听表示的新型机器人。CAVER包括三个新颖的贡献:1)一种新型3D打印末端执行器,可附加到平行夹持器上,可激发物体的音频响应;2)一种视听表示,它将局部和全局外观信息与声音特征相结合;3)一种探索算法,它以好奇心驱动的方式使用和构建视听表示,优先与高度不确定的对象进行交互,以较少的交互获得令人惊讶的音频的良好覆盖。我们证明,CAVER在不同的场景中更有效地构建丰富的表示形式,并且所学的视听表示在材料分类和仅使用音频的人类演示模仿方面带来了显著的改进。想了解更多信息请访问:CAVER机器人官网。
论文及项目相关链接
PDF 9 pages, 6 figures
Summary
本文主要介绍了一种名为CAVER的新型机器人技术,它能够构建和利用丰富的视听对象表示。通过三项新贡献,包括可附着于平行夹具的3D打印末端执行器、结合局部和全局外观信息与声音特征的视听表示,以及好奇心驱动的探索算法,CAVER机器人能够在不同场景下更有效地构建丰富的视听知识表示,显著提高材料分类和仅听音频的人类演示模仿的准确性。
Key Takeaways
- 多模态视听感知能力为机器人操作带来了新的可能性,如更好的材料分类和仅通过音频信号进行示范模仿(例如,通过听觉模仿弹奏乐曲)。
- 为了实现这种多模态潜力,机器人需要学习对象视觉外观与其交互时产生的声音之间的关联。
- CAVER机器人是一种新型机器人技术,能够构建和利用丰富的视听对象表示。
- CAVER包括三项新贡献:3D打印末端执行器、视听表示法和探索算法。
- 3D打印末端执行器可激发对象的音频响应。
- 视听表示法结合了局部和全局外观信息与声音特征。
- 探索算法采用好奇心驱动的方式,优先与不确定性较高的对象进行交互,以用较少的交互获得出人意料的音频覆盖。
点此查看论文截图
E2E-VGuard: Adversarial Prevention for Production LLM-based End-To-End Speech Synthesis
Authors:Zhisheng Zhang, Derui Wang, Yifan Mi, Zhiyong Wu, Jie Gao, Yuxin Cao, Kai Ye, Minhui Xue, Jie Hao
Recent advancements in speech synthesis technology have enriched our daily lives, with high-quality and human-like audio widely adopted across real-world applications. However, malicious exploitation like voice-cloning fraud poses severe security risks. Existing defense techniques struggle to address the production large language model (LLM)-based speech synthesis. While previous studies have considered the protection for fine-tuning synthesizers, they assume manually annotated transcripts. Given the labor intensity of manual annotation, end-to-end (E2E) systems leveraging automatic speech recognition (ASR) to generate transcripts are becoming increasingly prevalent, e.g., voice cloning via commercial APIs. Therefore, this E2E speech synthesis also requires new security mechanisms. To tackle these challenges, we propose E2E-VGuard, a proactive defense framework for two emerging threats: (1) production LLM-based speech synthesis, and (2) the novel attack arising from ASR-driven E2E scenarios. Specifically, we employ the encoder ensemble with a feature extractor to protect timbre, while ASR-targeted adversarial examples disrupt pronunciation. Moreover, we incorporate the psychoacoustic model to ensure perturbative imperceptibility. For a comprehensive evaluation, we test 16 open-source synthesizers and 3 commercial APIs across Chinese and English datasets, confirming E2E-VGuard’s effectiveness in timbre and pronunciation protection. Real-world deployment validation is also conducted. Our code and demo page are available at https://wxzyd123.github.io/e2e-vguard/.
最近语音合成技术的进展丰富了我们的日常生活,高质量、人性化的音频在真实世界应用中得到了广泛采用。然而,诸如语音克隆欺诈等恶意利用行为带来了严重的安全风险。现有的防御技术很难应对基于大型语言模型(LLM)的语音合成。虽然以前的研究已经考虑了对微调合成器的保护,但它们假设使用手动注释的文本。考虑到手动注释的劳动密集度,利用自动语音识别(ASR)生成文本端到端(E2E)系统正变得越来越普遍,例如通过商业API进行语音克隆。因此,这种端到端的语音合成也需要新的安全机制。为了应对这些挑战,我们提出了端到端VGuard(E2E-VGuard),这是一个针对两个新兴威胁的积极防御框架:(1)基于大型语言模型的语音合成生产,(2)由ASR驱动的端到端场景产生的新型攻击。具体来说,我们使用编码器集与特征提取器来保护音色,同时针对ASR设计的对抗性示例会破坏发音。此外,我们结合了心理声学模型来确保扰动的不易察觉性。为了进行全面评估,我们测试了16个开源合成器和跨中文和英文数据集的3个商业API,证实了E2E-VGuard在音色和发音保护方面的有效性。还进行了现实世界部署验证。我们的代码和演示页面可在https://wxzyd123.github.io/e2e-vguard/访问。
论文及项目相关链接
PDF Accepted to NeurIPS 2025
Summary
本文介绍了近期语音合成技术的进展及其在日常生活中的广泛应用,同时指出恶意利用如语音克隆欺诈等严重安全风险。现有防御技术难以应对基于大型语言模型(LLM)的语音合成,而现有研究主要集中在精细调整合成器的保护方面,并假设手动注释的文本。考虑到手动注释的劳动密集性,采用自动语音识别(ASR)生成文本的端到端(E2E)系统越来越普遍。因此,这种E2E语音合成也需要新的安全机制。针对这些挑战,提出了E2E-VGuard,一个针对两种新兴威胁的主动防御框架:一是基于生产LLM的语音合成,二是来自ASR驱动的E2E场景的全新攻击。通过编码器组合、特征提取器保护音色、针对ASR的对抗性实例干扰发音,并结合心理声学模型确保扰动不可察觉。实验测试了多种开源合成器和商业API,验证了E2E-VGuard在音色和发音保护方面的有效性。
Key Takeaways
- 近期语音合成技术广泛应用于日常生活,但存在恶意利用如语音克隆欺诈等安全风险。
- 现有防御技术难以应对基于大型语言模型(LLM)的语音合成和端到端(E2E)系统的新兴威胁。
- E2E-VGuard是一个针对LLM-based语音合成和E2E场景的新型攻击的主动防御框架。
- E2E-VGuard通过编码器组合、特征提取器保护音色,通过ASR对抗性实例干扰发音。
- 结合心理声学模型确保扰动不可察觉,增强防御策略的有效性。
- 实验测试了多种开源合成器和商业API,验证了E2E-VGuard在音色和发音保护方面的有效性。
点此查看论文截图
Reperio-rPPG: Relational Temporal Graph Neural Networks for Periodicity Learning in Remote Physiological Measurement
Authors:Ba-Thinh Nguyen, Thach-Ha Ngoc Pham, Hoang-Long Duc Nguyen, Thi-Duyen Ngo, Thanh-Ha Le
Remote photoplethysmography (rPPG) is an emerging contactless physiological sensing technique that leverages subtle color variations in facial videos to estimate vital signs such as heart rate and respiratory rate. This non-invasive method has gained traction across diverse domains, including telemedicine, affective computing, driver fatigue detection, and health monitoring, owing to its scalability and convenience. Despite significant progress in remote physiological signal measurement, a crucial characteristic - the intrinsic periodicity - has often been underexplored or insufficiently modeled in previous approaches, limiting their ability to capture fine-grained temporal dynamics under real-world conditions. To bridge this gap, we propose Reperio-rPPG, a novel framework that strategically integrates Relational Convolutional Networks with a Graph Transformer to effectively capture the periodic structure inherent in physiological signals. Additionally, recognizing the limited diversity of existing rPPG datasets, we further introduce a tailored CutMix augmentation to enhance the model’s generalizability. Extensive experiments conducted on three widely used benchmark datasets - PURE, UBFC-rPPG, and MMPD - demonstrate that Reperio-rPPG not only achieves state-of-the-art performance but also exhibits remarkable robustness under various motion (e.g., stationary, rotation, talking, walking) and illumination conditions (e.g., nature, low LED, high LED). The code is publicly available at https://github.com/deconasser/Reperio-rPPG.
远程光容积法(rPPG)是一种新兴的非接触式生理传感技术,它利用面部视频中的细微色彩变化来估计心率和呼吸率等生命体征。这种非侵入式的方法在远程医疗、情感计算、驾驶员疲劳检测以及健康监测等多个领域都受到了广泛的关注,因为它具有可扩展性和便捷性。尽管在远程生理信号测量方面已经取得了显著进展,但以前的方法常常忽略或未能充分建模一个重要特征——固有周期性,这在现实条件下限制了它们捕捉细微时间动态的能力。为了弥补这一不足,我们提出了Reperio-rPPG这一新型框架,它战略性地结合了关系卷积网络和图变换器,以有效地捕捉生理信号中固有的周期性结构。此外,我们意识到现有rPPG数据集多样性有限的问题,因此我们进一步引入了定制的CutMix增强技术来提高模型的泛化能力。在三个广泛使用的基准数据集——PURE、UBFC-rPPG和MMPD上进行的大量实验表明,Reperio-rPPG不仅达到了最先进的性能,而且在各种运动(如静止、旋转、说话、行走)和照明条件(如自然光、低LED、高LED)下也表现出了惊人的稳健性。代码公开可访问于 https://github.com/deconasser/Reperio-rPPG。
论文及项目相关链接
Summary
远程光容积脉搏波法(rPPG)是一种新兴的无接触生理传感技术,它通过面部视频中的细微色彩变化来估计心率和呼吸率等生命体征。由于其可扩展性和便利性,它在远程生理信号测量领域取得了进展,并广泛应用于远程医疗、情感计算、驾驶员疲劳检测和健康监测等领域。针对以往研究中忽略或建模不足的固有周期性特征,我们提出了Reperio-rPPG框架,该框架结合了关系卷积网络和图变换器,以有效捕捉生理信号中的周期性结构。同时,我们引入定制的CutMix增强方法以提高模型的泛化能力。在多个广泛使用的基准数据集上的实验表明,Reperio-rPPG不仅达到了最先进的性能水平,还在各种运动和光照条件下表现出了卓越的鲁棒性。
Key Takeaways
- 远程光容积脉搏波法(rPPG)是一种利用面部视频中的色彩变化估计生命体征的非接触生理传感技术。
- rPPG技术在远程医疗、情感计算、驾驶员疲劳检测和健康监测等领域具有广泛应用。
- 以往rPPG研究常忽略或建模不足的固有周期性特征限制了其在真实世界条件下的精细时间动态捕捉能力。
- Reperio-rPPG框架结合了关系卷积网络和图变换器,以有效捕捉生理信号中的周期性结构。
- Reperio-rPPG通过引入定制的CutMix增强方法提高了模型的泛化能力。
- 在多个基准数据集上的实验表明,Reperio-rPPG达到了最先进的性能水平。
点此查看论文截图
Long Grounded Thoughts: Distilling Compositional Visual Reasoning Chains at Scale
Authors:David Acuna, Chao-Han Huck Yang, Yuntian Deng, Jaehun Jung, Ximing Lu, Prithviraj Ammanabrolu, Hyunwoo Kim, Yuan-Hong Liao, Yejin Choi
Recent progress in multimodal reasoning has been driven largely by undisclosed datasets and proprietary data synthesis recipes, leaving open questions about how to systematically build large-scale, vision-centric reasoning datasets, particularly for tasks that go beyond visual math. In this work, we introduce a new reasoning data generation framework spanning diverse skills and levels of complexity with over 1M high-quality synthetic vision-centric questions. The dataset also includes preference data and instruction prompts supporting both offline and online RL. Our synthesis framework proceeds in two stages: (1) scale; and (2) complexity. Reasoning traces are then synthesized through a two-stage process that leverages VLMs and reasoning LLMs, producing CoT traces for VLMs that capture the richness and diverse cognitive behaviors found in frontier reasoning models. Remarkably, we show that finetuning Qwen2.5-VL-7B on our data outperforms all open-data baselines across all evaluated vision-centric benchmarks, and even surpasses strong closed-data models such as MiMo-VL-7B-RL on V* Bench, CV-Bench and MMStar-V. Perhaps most surprising, despite being entirely vision-centric, our data transfers positively to text-only reasoning (MMLU-Pro) and audio reasoning (MMAU), demonstrating its effectiveness. Similarly, despite not containing videos or embodied visual data, we observe notable gains when evaluating on a single-evidence embodied QA benchmark (NiEH). Finally, we use our data to analyze the entire VLM post-training pipeline. Our empirical analysis highlights that (i) SFT on high-quality data with non-linear reasoning traces is essential for effective online RL, (ii) staged offline RL matches online RL’s performance while reducing compute demands, and (iii) careful SFT on high quality data can substantially improve out-of-domain, cross-modality transfer.
最近的多模态推理进展在很大程度上得益于未公开的数据集和专有数据合成方法,关于如何系统地构建大规模、以视觉为中心的推理数据集仍存在疑问,特别是对于超越视觉数学的任务。在这项工作中,我们引入了一个新的推理数据生成框架,该框架涵盖各种技能和复杂程度,包含超过100万高质量的合成视觉中心问题。该数据集还包括支持离线和在线强化学习的偏好数据和指令提示。我们的合成框架分为两个阶段:(1)规模;(2)复杂性。然后通过两阶段过程合成推理轨迹,该过程利用视觉语言模型(VLMs)和推理大型语言模型(LLMs),为VLMs生成捕获前沿推理模型中丰富和多样化的认知行为的CoT轨迹。值得注意的是,我们在我们的数据集上微调Qwen2.5-VL-7B的表现超过了所有公开数据基准测试的所有视觉中心基准测试,甚至超越了MiMo-VL-7B-RL等强大的封闭数据模型在V*Bench、CV-Bench和MMStar-V上的表现。最令人惊讶的是,尽管完全以视觉为中心,我们的数据对文本推理(MMLU-Pro)和音频推理(MMAU)产生了积极的影响。同样,尽管不包含视频或身临其境的视觉数据,但在单一证据身临其境的问答基准测试(NiEH)上评估时,我们观察到了显著的收益。最后,我们使用我们的数据分析了整个VLM后训练管道。我们的实证分析表明:(i)在具有非线性推理轨迹的高质量数据上进行微调对于有效的在线强化学习至关重要,(ii)分阶段的离线强化学习与在线强化学习的性能相匹配,同时降低了计算需求,(iii)在高质量数据上进行仔细的微调可以极大地提高跨领域的跨模态迁移能力。
论文及项目相关链接
PDF Project Page: https://nvlabs.github.io/LongGroundedThoughts/
Summary
本文介绍了一个全新的推理数据生成框架,该框架涵盖各项技能和不同复杂度级别,包含超过100万高质量合成视觉中心问题。数据集包含支持离线在线强化学习的偏好数据和指令提示。合成框架分为两个阶段:规模和复杂度。通过利用视觉语言模型(VLMs)和推理大型语言模型(LLMs),进行两阶段推理轨迹合成,捕捉前沿推理模型的丰富性和多样化认知行为。研究表明,在本文数据集上微调模型表现优越,超过所有公开数据基准测试,并在某些基准测试中超越封闭数据模型。本文数据在视觉中心任务外,如文本和音频推理任务中也表现出积极迁移效果。最后,使用本文数据分析整个VLM后训练管道,强调高质量数据的重要性以及分阶段离线RL在减少计算需求的同时匹配在线RL性能的优势。
Key Takeaways
- 引入了一个涵盖多种技能和复杂度级别的大规模、高质量合成视觉为中心的数据集,包括超过一百万的问题。
- 合成框架分为规模和复杂度两个阶段进行数据生成。
- 利用视觉语言模型(VLMs)和推理大型语言模型(LLMs)来生成包含丰富认知行为的推理轨迹。
- 模型在该数据集上的表现超过所有公开基准测试,并且在某些基准测试中超越了使用封闭数据的模型。
- 数据展现出积极迁移效应,适用于视觉中心任务以外的其他领域,如文本和音频推理任务。
- 实证分析显示,高质量数据对于有效强化学习至关重要。分阶段离线强化学习能在减少计算需求的同时与在线强化学习性能相匹配。
点此查看论文截图
psiUnity: A Platform for Multimodal Data-Driven XR
Authors:Akhil Ajikumar, Sahil Mayenkar, Steven Yoo, Sakib Reza, Mohsen Moghaddam
Extended reality (XR) research increasingly relies on the ability to stream and synchronize multimodal data between headsets and immersive applications for data-driven interaction and experimentation. However, developers face a critical gap: the Platform for Situated Intelligence (psi), which excels at deterministic temporal alignment and multimodal data management, has been largely inaccessible to the dominant Unity/MRTK ecosystem used for HoloLens development. We introduce psiUnity, an open-source C# integration that bridges psi’s .NET libraries with Unity 2022.3 and MRTK3 for HoloLens 2. psiUnity enables bidirectional, real-time streaming of head pose, hand tracking, gaze, IMU, audio, and depth sensor data (AHAT and long-throw) with microsecond-level temporal precision, allowing Unity applications to both consume and produce synchronized multimodal data streams. By embedding psi’s native serialization, logging, and temporal coordination directly within Unity’s architecture, psiUnity extends psi beyond its previous StereoKit limitations and empowers the HRI, HCI, and embodied-AI communities to develop reproducible, data-driven XR interactions and experiments within the familiar Unity environment. The integration is available at https://github.com/sailgt/psiUnity.
扩展现实(XR)研究越来越依赖于在头戴设备和沉浸式应用程序之间流式传输和同步多模态数据的能力,以实现数据驱动的交互和实验。然而,开发者面临一个关键的鸿沟:擅长确定性时间对齐和多模态数据管理的情境智能平台(psi)在很大程度上无法被用于HoloLens开发的Unity/MRTK生态系统所访问。我们推出了psiUnity,这是一个开源的C#集成,旨在将psi的.NET库与Unity 2022.3和MRTK3(用于HoloLens 2)进行集成。psiUnity使双向实时流式传输头部姿态、手部跟踪、注视、IMU、音频和深度传感器数据(AHAT和长距离)成为可能,具有微秒级的时间精度,允许Unity应用程序同时消费和生产同步多模态数据流。通过将psi的原生序列化、日志记录和临时协调直接嵌入Unity的架构中,psiUnity超越了其之前的StereoKit限制,并使HRI、HCI和实体AI社区能够在熟悉的Unity环境中开发可重复的、数据驱动的XR交互和实验。该集成可在https://github.com/sailgt/psiUnity获得。
论文及项目相关链接
Summary
该文本介绍了Extended Reality(XR)研究中面临的挑战,即需要流化和同步头戴设备和沉浸式应用程序之间的多模态数据以实现数据驱动的交互和实验。为此,该文介绍了psiUnity这一开源C#集成解决方案,它成功地将psi的.NET库与Unity 2022.3和MRTK3连接起来,使得为HoloLens 2开发的程序能够进行双向实时流化包括头部姿态、手部追踪、注视、IMU、音频和深度传感器数据等(AHAT和长距离投射),并达到微秒级的时序精度。这为Unity应用程序提供了消费和生产同步多模态数据流的能力。通过将psi的原生序列化、日志记录和时序协调直接嵌入Unity架构中,psiUnity不仅超越了之前的StereoKit限制,还为HRI、HCI和embodied-AI社区提供了在熟悉的Unity环境中开发可重复的、数据驱动的XR交互和实验的能力。
Key Takeaways
- Extended Reality (XR) 研究需要流化和同步头戴设备和沉浸式应用之间的多模态数据。
- 存在一个关键差距:用于HoloLens开发的psi平台未能与主流的Unity/MRTK生态系统实现融合。
- psiUnity是首个实现这一融合的解决方案,它是一个开源的C#集成解决方案,旨在连接psi的.NET库与Unity 2022.3和MRTK3。
- psiUnity能够实现头部姿态、手部追踪等多种数据的双向实时流化。这种流化能够达到微秒级的时序精度。
- psiUnity使得Unity应用程序既能消费也能产生同步的多模态数据流。
- psiUnity通过将原生序列化等功能直接嵌入Unity架构中,扩展了psi的功能并超越了之前的StereoKit限制。
点此查看论文截图
Enhancing Public Speaking Skills in Engineering Students Through AI
Authors:Amol Harsh, Brainerd Prince, Siddharth Siddharth, Deepan Raj Prabakar Muthirayan, Kabir S Bhalla, Esraaj Sarkar Gupta, Siddharth Sahu
This research-to-practice full paper was inspired by the persistent challenge in effective communication among engineering students. Public speaking is a necessary skill for future engineers as they have to communicate technical knowledge with diverse stakeholders. While universities offer courses or workshops, they are unable to offer sustained and personalized training to students. Providing comprehensive feedback on both verbal and non-verbal aspects of public speaking is time-intensive, making consistent and individualized assessment impractical. This study integrates research on verbal and non-verbal cues in public speaking to develop an AI-driven assessment model for engineering students. Our approach combines speech analysis, computer vision, and sentiment detection into a multi-modal AI system that provides assessment and feedback. The model evaluates (1) verbal communication (pitch, loudness, pacing, intonation), (2) non-verbal communication (facial expressions, gestures, posture), and (3) expressive coherence, a novel integration ensuring alignment between speech and body language. Unlike previous systems that assess these aspects separately, our model fuses multiple modalities to deliver personalized, scalable feedback. Preliminary testing demonstrated that our AI-generated feedback was moderately aligned with expert evaluations. Among the state-of-the-art AI models evaluated, all of which were Large Language Models (LLMs), including Gemini and OpenAI models, Gemini Pro emerged as the best-performing, showing the strongest agreement with human annotators. By eliminating reliance on human evaluators, this AI-driven public speaking trainer enables repeated practice, helping students naturally align their speech with body language and emotion, crucial for impactful and professional communication.
本文是一篇从研究到实践的全文,灵感来源于工程学生之间有效沟通的持久挑战。公共演讲是未来工程师必备的技能,因为他们需要与各种利益相关者交流技术知识。虽然大学提供课程或研讨会,但它们无法为学生提供持续和个性化的培训。对公共演讲的言语和非言语方面提供全面的反馈是非常耗时的,使得一致和个性化的评估不切实际。本研究整合了公共演讲中的言语和非言语线索的研究,以开发一个用于工程学生的AI驱动评估模型。我们的方法结合了语音分析、计算机视觉和情感检测,形成一个多模式AI系统,提供评估和反馈。该模型评估(1)言语交流(音调、音量、语速、语调)、(2)非言语交流(面部表情、手势、姿势),以及(3)表达一致性,这是一种新型集成,确保言语和肢体语言之间的对齐。与以前分别评估这些方面的系统不同,我们的模型融合了多种模式,以提供个性化、可扩展的反馈。初步测试表明,我们AI生成的反馈与专家评估中度对齐。在评估的先进AI模型中,包括双子座和OpenAI模型等所有大型语言模型中,双子座专业版表现出最佳性能,与人类注释者的契合度最高。通过消除对人类评估者的依赖,这个AI驱动的公共演讲训练器可以重复练习,帮助学生自然地调整他们的演讲与肢体语言和情感的一致性,这对于有冲击力和专业的沟通至关重要。
论文及项目相关链接
Summary
本研究结合工程学生在沟通方面所面临的挑战,提出了一种人工智能驱动的评估模型,该模型能够对学生的演讲技能进行全面评估并给出反馈。模型融合了语音分析、计算机视觉和情感检测等技术,从多个维度对学生的口头和非口头沟通能力进行评估,并提供个性化的反馈。初步测试表明,该模型的评估结果与专家评价基本一致。在多种先进的AI模型中,Gemini Pro表现最佳,与人类评价者的评估结果最为接近。该模型可帮助学生提高沟通技能,为未来的工程师提供专业且有影响力的沟通技巧。
Key Takeaways
- 本研究旨在解决工程学生在沟通方面的挑战,特别是公共演讲技能的培养。
- 模型融合了语音分析、计算机视觉和情感检测等技术,多维度评估学生的口头表达能力。
- 反馈机制注重学生的非语言沟通能力,包括面部表情、姿势和手势等。
- 创新性地提出了“表达一致性”的评估标准,确保口语和肢体语言之间的协调。
- 在初步测试中,AI模型的评估结果与专家评价保持一致。
- 在多个AI模型中,Gemini Pro表现最佳,与人类评价者的评估结果最为接近。
点此查看论文截图
Caption Injection for Optimization in Generative Search Engine
Authors:Xiaolu Chen, Yong Liao
Generative Search Engines (GSEs) leverage Retrieval-Augmented Generation (RAG) techniques and Large Language Models (LLMs) to integrate multi-source information and provide users with accurate and comprehensive responses. Unlike traditional search engines that present results in ranked lists, GSEs shift users’ attention from sequential browsing to content-driven subjective perception, driving a paradigm shift in information retrieval. In this context, enhancing the subjective visibility of content through Generative Search Engine Optimization (G-SEO) methods has emerged as a new research focus. With the rapid advancement of Multimodal Retrieval-Augmented Generation (MRAG) techniques, GSEs can now efficiently integrate text, images, audio, and video, producing richer responses that better satisfy complex information needs. Existing G-SEO methods, however, remain limited to text-based optimization and fail to fully exploit multimodal data. To address this gap, we propose Caption Injection, the first multimodal G-SEO approach, which extracts captions from images and injects them into textual content, integrating visual semantics to enhance the subjective visibility of content in generative search scenarios. We systematically evaluate Caption Injection on MRAMG, a benchmark for MRAG, under both unimodal and multimodal settings. Experimental results show that Caption Injection significantly outperforms text-only G-SEO baselines under the G-Eval metric, demonstrating the necessity and effectiveness of multimodal integration in G-SEO to improve user-perceived content visibility.
生成式搜索引擎(GSE)利用检索增强生成(RAG)技术和大型语言模型(LLM)来整合多源信息,为用户提供准确而全面的回答。与传统的仅呈现排名结果列表的搜索引擎不同,GSE将用户的注意力从顺序浏览转变为内容驱动的主观感知,推动了信息检索的范式转变。在此背景下,通过生成式搜索引擎优化(G-SEO)方法来提高内容的主观可见性已成为新的研究重点。随着多模态检索增强生成(MRAG)技术的快速发展,GSE现在可以有效地整合文本、图像、音频和视频,产生更丰富、更能满足复杂信息需求的回答。然而,现有的G-SEO方法仍然局限于基于文本的优化,未能充分利用多模态数据。为了弥补这一空白,我们提出了“Caption Injection”这一首个多模态G-SEO方法,它从图像中提取字幕并将其注入文本内容中,整合视觉语义来提高生成场景中内容的主观可见性。我们在MRAMG这一MRAG基准测试下系统地评估了Caption Injection在单模态和多模态设置下的表现。实验结果表明,在G-Eval指标下,Caption Injection显著优于仅使用文本的G-SEO基线,证明了在多模态集成在G-SEO中的必要性和有效性,以提高用户感知的内容可见性。
论文及项目相关链接
Summary
基于生成式搜索引擎(GSE)和大型语言模型(LLM)的多源信息融合技术,使得用户可以获得精准全面的响应。不同于传统搜索引擎的结果排名展示方式,GSE将用户的注意力从顺序浏览转变为内容驱动的感性认知,实现了信息检索的范式转变。在此背景下,通过生成式搜索引擎优化(G-SEO)方法提升内容的感性可见度已成为新的研究焦点。随着多模态检索增强生成(MRAG)技术的快速发展,GSE能够高效融合文本、图像、音频和视频,产生更丰富、更满足复杂信息需求的响应。然而,现有的G-SEO方法仅限于文本优化,未能充分利用多模态数据。为解决这一空白,我们提出“Caption Injection”作为首个多模态G-SEO方法,通过提取图像字幕并注入文本内容,整合视觉语义信息,提升生成场景中内容的感性可见度。系统评估表明,在MRAG的基准测试MRAMG上,无论是单模态还是多模态环境下,Caption Injection均显著优于仅文本的G-SEO基线。
Key Takeaways
- GSEs整合多源信息,提供精准全面的响应。
- 传统搜索引擎主要依赖结果排名展示方式,而GSEs推动信息检索从顺序浏览向内容感知转变。
- G-SEO方法旨在提升内容的感性可见度。
- MRAG技术使得GSE能够融合多种模态数据(如文本、图像、音频和视频)。
- 现有G-SEO方法主要局限于文本优化。
- Caption Injection作为首个多模态G-SEO方法,通过整合视觉语义信息提升内容可见度。
点此查看论文截图
Laugh, Relate, Engage: Stylized Comment Generation for Short Videos
Authors:Xuan Ouyang, Senan Wang, Bouzhou Wang, Siyuan Xiahou, Jinrong Zhou, Yuekang Li
Short-video platforms have become a central medium in the modern Internet landscape, where efficient information delivery and strong interactivity are reshaping user engagement and cultural dissemination. Among the various forms of user interaction, comments play a vital role in fostering community participation and enabling content re-creation. However, generating comments that are both compliant with platform guidelines and capable of exhibiting stylistic diversity and contextual awareness remains a significant challenge. We introduce LOLGORITHM, a modular multi-agent system (MAS) designed for controllable short-video comment generation. The system integrates video segmentation, contextual and affective analysis, and style-aware prompt construction. It supports six distinct comment styles: puns (homophones), rhyming, meme application, sarcasm (irony), plain humor, and content extraction. Powered by a multimodal large language model (MLLM), LOLGORITHM directly processes video inputs and achieves fine-grained style control through explicit prompt markers and few-shot examples. To support development and evaluation, we construct a bilingual dataset using official APIs from Douyin (Chinese) and YouTube (English), covering five popular video genres: comedy skits, daily life jokes, funny animal clips, humorous commentary, and talk shows. Evaluation combines automated metrics originality, relevance, and style conformity with a large-scale human preference study involving 40 videos and 105 participants. Results show that LOLGORITHM significantly outperforms baseline models, achieving preference rates of over 90% on Douyin and 87.55% on YouTube. This work presents a scalable and culturally adaptive framework for stylized comment generation on short-video platforms, offering a promising path to enhance user engagement and creative interaction.
短视频平台已成为现代互联网景观中一种重要的媒介,高效的信息传递和强大的互动性正在重塑用户参与和文化传播的方式。在多种形式的用户互动中,评论对于促进社区参与和推动内容再创作起到了至关重要的作用。然而,生成既符合平台指南又具备风格多样性和上下文意识的评论仍然是一个巨大的挑战。我们引入了LOLORITHM,这是一个专为可控短视频评论生成设计的模块化多智能体系统(MAS)。该系统融合了视频分割、上下文和情感分析以及风格感知提示构建。它支持六种不同的评论风格:双关语、押韵、模因应用、讽刺、普通幽默和内容提取。LOLORITHM由多模态大型语言模型(MLLM)驱动,直接处理视频输入,并通过明确的提示标记和少量示例实现精细的风格控制。为了支持和评估,我们使用抖音(中文)和YouTube(英文)的官方API构建了一个双语数据集,涵盖了五种流行的视频类型:喜剧短片、日常生活笑话、有趣的动物剪辑、幽默评论和谈话节目。评估结合了自动化指标(如原创性、相关性和风格一致性)以及涉及40个视频和105名参与者的大规模人类偏好研究。结果表明,LOLORITHM显著优于基准模型,在抖音上的偏好率超过90%,在YouTube上的偏好率为87.55%。这项工作为短视频平台上的风格化评论生成提供了一个可扩展且文化适应的框架,为增强用户参与度和创造性互动提供了有希望的途径。
论文及项目相关链接
摘要
短视频平台已成为现代互联网景观中的核心媒介,高效的信息传递和强大的交互性正在重塑用户参与和文化传播。评论在促进社区参与和内容再创作方面发挥着至关重要的作用。然而,生成既符合平台规范又具备风格多样性和上下文意识的评论仍然是一个重大挑战。我们推出了LOLGORITHM,这是一个为可控短视频评论生成设计的模块化多智能体系统(MAS)。该系统融合了视频分割、上下文和情感分析以及风格感知提示构建。它支持六种不同的评论风格:双关语、押韵、模因应用、讽刺、普通幽默和内容提取。借助多模态大型语言模型(MLLM),LOLGORITHM直接处理视频输入,并通过明确的提示标记和少量示例实现精细风格控制。为了支持和评估,我们使用抖音(中文)和YouTube(英文)的官方API构建了一个双语数据集,涵盖了五种流行的视频类型:喜剧短片、日常生活笑话、有趣的动物剪辑、幽默评论和脱口秀。评估结合了自动化指标原创性、相关性和风格一致性以及大规模的人类偏好研究,涉及40个视频和11名参与者。结果表明,LOLGORITHM显著优于基线模型,在抖音上的偏好率达到9%以上,YouTube上的偏好率达到87.5%。这项工作为短视频平台上的风格化评论生成提供了一个可扩展和文化适应的框架,为增强用户参与度和创造性互动提供了有希望的途径。
要点
- 短视频平台已成为现代互联网的核心,高效的信息传递和强大的交互性重塑了用户参与和文化传播。
- 评论在促进社区参与和内容再创作方面发挥关键作用,但生成符合平台规范的评论具有挑战性。
- LOLGORITHM是一个为可控短视频评论生成设计的模块化多智能体系统(MAS)。
- 系统支持六种评论风格,结合视频分割、上下文和情感分析以及风格感知提示构建技术。
- 使用多模态大型语言模型(MLLM)实现精细风格控制,能直接处理视频输入。
- 构建了一个双语数据集用于研究和评估,涵盖了五种流行的视频类型。
点此查看论文截图
SigmaCollab: An Application-Driven Dataset for Physically Situated Collaboration
Authors:Dan Bohus, Sean Andrist, Ann Paradiso, Nick Saw, Tim Schoonbeek, Maia Stiber
We introduce SigmaCollab, a dataset enabling research on physically situated human-AI collaboration. The dataset consists of a set of 85 sessions in which untrained participants were guided by a mixed-reality assistive AI agent in performing procedural tasks in the physical world. SigmaCollab includes a set of rich, multimodal data streams, such as the participant and system audio, egocentric camera views from the head-mounted device, depth maps, head, hand and gaze tracking information, as well as additional annotations performed post-hoc. While the dataset is relatively small in size (~ 14 hours), its application-driven and interactive nature brings to the fore novel research challenges for human-AI collaboration, and provides more realistic testing grounds for various AI models operating in this space. In future work, we plan to use the dataset to construct a set of benchmarks for physically situated collaboration in mixed-reality task assistive scenarios. SigmaCollab is available at https://github.com/microsoft/SigmaCollab.
我们介绍了SigmaCollab数据集,该数据集支持对物理环境中人类与人工智能合作的研究。数据集包含85个会话集,其中未经过训练的参与者在一个混合现实辅助人工智能代理的指导下,在物理世界中执行程序任务。SigmaCollab包含一组丰富的多模式数据流,如参与者和系统音频、头戴设备上的以自我为中心的相机视图、深度图、头部、手部以及注视追踪信息,以及事后进行的额外注释。虽然数据集大小相对较小(约14小时),但其应用驱动和交互性为针对人类与人工智能合作的研究带来了新的挑战,并为各种在此空间运行的AI模型提供了更现实的测试平台。未来,我们计划使用该数据集为混合现实任务辅助场景中的物理环境合作构建一套基准测试。SigmaCollab可在https://github.com/microsoft/SigmaCollab获取。
论文及项目相关链接
摘要
SigmaCollab数据集用于研究物理环境中的人类-人工智能协作。该数据集包含85个会话,其中未经训练的参与者在混合现实辅助AI代理的指导下完成物理世界的程序任务。SigmaCollab包含丰富的多模式数据流,如参与者和系统音频、来自头戴设备的个人视角相机视图、深度图、头部、手部以及注视追踪信息等,还包括事后进行的额外注释。虽然数据集大小相对较小(约14小时),但其以应用为导向和交互性为人类与人工智能的协作带来了新颖的研究挑战,并为这一领域的各种人工智能模型提供了更现实的测试环境。未来,我们计划使用该数据集为混合现实任务辅助场景中的物理协作构建一系列基准测试。SigmaCollab数据集可在https://github.com/microsoft/SigmaCollab获取。
要点
- SigmaCollab是一个用于研究物理环境中人类-AI协作的数据集。
- 数据集包含85个会话,涉及参与者在AI的指导下完成物理任务。
- 数据集包含丰富的多模式数据,包括音频、视频、深度图以及头部、手部跟踪信息等。
- 数据集虽然规模小,但为人工智能模型在混合现实任务中的真实表现提供了测试环境。
- 数据集为未来构建物理协作的基准测试提供了潜力。
- SigmaCollab数据集可用于研究人类与AI交互的新挑战和机会。
点此查看论文截图
Evaluating Emotion Recognition in Spoken Language Models on Emotionally Incongruent Speech
Authors:Pedro Corrêa, João Lima, Victor Moreno, Lucas Ueda, Paula Dornhofer Paro Costa
Advancements in spoken language processing have driven the development of spoken language models (SLMs), designed to achieve universal audio understanding by jointly learning text and audio representations for a wide range of tasks. Although promising results have been achieved, there is growing discussion regarding these models’ generalization capabilities and the extent to which they truly integrate audio and text modalities in their internal representations. In this work, we evaluate four SLMs on the task of speech emotion recognition using a dataset of emotionally incongruent speech samples, a condition under which the semantic content of the spoken utterance conveys one emotion while speech expressiveness conveys another. Our results indicate that SLMs rely predominantly on textual semantics rather than speech emotion to perform the task, indicating that text-related representations largely dominate over acoustic representations. We release both the code and the Emotionally Incongruent Synthetic Speech dataset (EMIS) to the community.
随着口语处理技术的进步,口语模型(SLMs)的发展旨在通过联合学习文本和音频表示来广泛实现各种任务的通用音频理解。虽然取得了令人鼓舞的结果,但对于这些模型的泛化能力以及它们内部表示中真正整合音频和文本模式的程度,正在出现越来越多的讨论。在这项工作中,我们使用情感不一致的语音样本数据集评估了四种SLM在语音情感识别任务上的表现。这是一种情况下,口头话语的语义内容传达了一种情感,而语音表现力传达了另一种情感。我们的结果表明,SLM主要依赖于文本语义而不是语音情感来完成任务,这表明文本相关表示在很大程度上占据了主导地位,超过了声学表示。我们向社区发布了代码和情绪不一致合成语音数据集(EMIS)。
论文及项目相关链接
PDF Submitted to IEEE ICASSP 2026. Copyright 2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses
Summary
随着语音识别技术的不断进步,口头语言模型(SLMs)得以发展,旨在通过联合学习文本和音频表示来实现通用音频理解,并广泛应用于各种任务。然而,关于这些模型的泛化能力和其在内部表示中真正融合音频和文本模态的程度,存在越来越多的讨论。本研究在情感不匹配的语音样本数据集上评估了四种SLM,在这种情境下,口头话语的语义内容传达了一种情感,而语音表达则传达了另一种情感。研究结果表明,SLM主要依赖于文本语义而非语音情感来完成任务,这意味着文本相关的表示在很大程度上优于声音表示。我们向社区发布了代码和情感不匹配的合成语音数据集(EMIS)。
Key Takeaways
- 口头语言模型(SLMs)通过联合学习文本和音频表示,实现了通用音频理解,并广泛应用于各种任务。
- SLMs在情感不匹配的语音样本数据集上的表现被评估。
- SLMs主要依赖于文本语义完成语音情感识别任务,而非语音情感。
- 文本相关的表示在很大程度上优于声音表示。
- SLMs的泛化能力是一个重要的问题,需要更多的研究。
- 音频和文本模态在SLMs内部表示中的融合程度有限。
点此查看论文截图
Leveraging Large Language Models to Identify Conversation Threads in Collaborative Learning
Authors:Prerna Ravi, Dong Won Lee, Beatriz Flamia, Jasmine David, Brandon Hanks, Cynthia Breazeal, Emma Anderson, Grace Lin
Understanding how ideas develop and flow in small-group conversations is critical for analyzing collaborative learning. A key structural feature of these interactions is threading, the way discourse talk naturally organizes into interwoven topical strands that evolve over time. While threading has been widely studied in asynchronous text settings, detecting threads in synchronous spoken dialogue remains challenging due to overlapping turns and implicit cues. At the same time, large language models (LLMs) show promise for automating discourse analysis but often struggle with long-context tasks that depend on tracing these conversational links. In this paper, we investigate whether explicit thread linkages can improve LLM-based coding of relational moves in group talk. We contribute a systematic guidebook for identifying threads in synchronous multi-party transcripts and benchmark different LLM prompting strategies for automated threading. We then test how threading influences performance on downstream coding of conversational analysis frameworks, that capture core collaborative actions such as agreeing, building, and eliciting. Our results show that providing clear conversational thread information improves LLM coding performance and underscores the heavy reliance of downstream analysis on well-structured dialogue. We also discuss practical trade-offs in time and cost, emphasizing where human-AI hybrid approaches can yield the best value. Together, this work advances methods for combining LLMs and robust conversational thread structures to make sense of complex, real-time group interactions.
理解小组对话中想法的产生和发展对于分析协作学习至关重要。这些互动的一个关键结构特征是“线索串联”,即话语自然组织成相互交织的主题线索,随着时间的推移而发展。虽然线索串联在异步文本环境中已被广泛研究,但在同步对话中检测线索串联仍然具有挑战性,因为存在重叠的发言和隐含的线索。同时,大型语言模型(LLM)在自动化话语分析方面显示出潜力,但在依赖于追踪这些对话链接的长上下文任务方面往往表现挣扎。
论文及项目相关链接
PDF In Submission: Journal of Educational Data Mining (jEDM) 2026
Summary
本文探讨了小群组对话中思想发展与流动的重要性,强调分析协作学习的关键。文章研究了对话中的“线程”结构,即话语自然组织成交织的主题线索,随时间发展演变。尽管线程在异步文本环境中已被广泛研究,但在同步对话中检测线程仍具挑战性。大型语言模型(LLMs)在自动话语分析方面显示出潜力,但在依赖追踪这些对话链接的长上下文任务方面往往表现挣扎。本文调查了明确线程链接是否能改进基于LLM的关系动作编码。我们为同步多方对话中线程的识别提供了系统指南,并对不同的LLM提示策略进行了基准测试。然后测试了线程对下游会话分析框架编码的影响,这些框架捕捉核心协作动作,如同意、构建和激发。结果表明,提供清晰的对话线程信息提高了LLM编码性能,并强调了下游分析对结构良好对话的依赖。
Key Takeaways
- 小组对话中的思想发展与流动对协作学习分析至关重要。
- 对话中的“线程”结构是分析重点,指话语自然交织的主题线索随时间发展演变。
- 同步对话中的线程检测具有挑战性,因存在重叠的发言和隐含的线索。
- 大型语言模型(LLMs)在自动话语分析方面有潜力,但在长上下文任务上表现可能受限。
- 明确线程链接能改进基于LLM的关系动作编码。
- 提供清晰的对话线程信息对提高LLM编码性能至关重要。
点此查看论文截图
Beyond touch-based HMI: Control your machines in natural language by utilizing large language models and OPC UA
Authors:Bernd Hofmann, Sven Kreitlein, Joerg Franke, Patrick Bruendl
This paper proposes an agent-based approach toward a more natural interface between humans and machines. Large language models equipped with tools and the communication standard OPC UA are utilized to control machines in natural language. Instead of touch interaction, which is currently the state-of-the-art medium for interaction in operations, the proposed approach enables operators to talk or text with machines. This allows commands such as ‘Please decrease the temperature by 20 % in machine 1 and set the motor speed to 5000 rpm in machine 2.’ The large language model receives the user input and selects one of three predefined tools that connect to an OPC UA server and either change or read the value of a node. Afterwards, the result of the tool execution is passed back to the language model, which then provides a final response to the user. The approach is universally designed and can therefore be applied to any machine that supports the OPC UA standard. The large language model is neither fine-tuned nor requires training data, only the relevant machine credentials and a parameter dictionary are included within the system prompt. The approach is evaluated on a Siemens S7-1500 programmable logic controller with four machine parameters in a case study of fifty synthetically generated commands on five different models. The results demonstrate high success rate, with proprietary GPT 5 models achieving accuracies between 96.0 % and 98.0 %, and open-weight models reaching up to 90.0 %. The proposed approach of this empirical study contributes to advancing natural interaction in industrial human-machine interfaces.
本文提出了一种基于代理的方法,以实现更自然的人机交互界面。利用配备工具和通信标准OPC UA的大型语言模型,以自然语言控制机器,而无需使用当前的触摸交互作为操作交互的主流媒介。所提出的方法使操作者可以与机器进行对话或文本交流,例如下达指令:“请在机器1中将温度降低20%并将机器2的电机速度设置为5000转/分。”大型语言模型接收用户输入,并选择三种预定义工具之一连接到OPC UA服务器,以更改或读取节点的值。然后,工具执行的结果将反馈给语言模型,然后为用户提供最终响应。该方法设计通用,可应用于支持OPC UA标准的任何机器。大型语言模型既不需要微调也不需要训练数据,系统中只需包含相关的机器凭据和参数词典即可。该方法在西门子S7-1500可编程逻辑控制器上进行评估,在一个案例研究中使用五十个合成生成的命令对五个不同模型进行测试。结果显示成功率很高,专有GPT 5模型的准确率在96.0%至98.0%之间,开放权重模型的准确率高达90.0%。本实证研究所提出的方法有助于推动工业人机界面中的自然交互发展。
论文及项目相关链接
Summary
本文提出一种基于代理的自然人机交互方式。该研究使用大型语言模型及OPC UA通信标准来控制机器的自然语言交互。此方式允许操作者通过对话或文本与机器沟通,替代当前主流的触摸交互方式。研究展示了如何使用语言模型选择工具与OPC UA服务器连接,改变或读取节点值,并反馈执行结果。该方式可广泛应用于支持OPC UA标准的任何机器。该语言模型无需微调,也不需训练数据,只需系统提示中输入相关机器凭证和参数词典即可。在一个使用Siemens S7-1500可编程逻辑控制器和五个不同模型的案例研究中,该方法的成功率很高,专有GPT 5模型的准确率在96.0%至98.0%之间,开放权重模型的准确率高达90.0%。此研究有助于推动工业人机界面中的自然交互发展。
Key Takeaways
- 该论文提出了一种基于代理的自然人机交互方式,使用大型语言模型和OPC UA通信标准来控制机器。
- 替代了当前主流的触摸交互方式,允许操作者通过对话或文本与机器进行沟通。
- 语言模型能够选择工具与OPC UA服务器连接,实现改变或读取节点值的功能,并反馈执行结果。
- 该方式具有通用性,可应用于任何支持OPC UA标准的机器。
- 语言模型无需微调,且不依赖训练数据,只需系统提示中输入相关机器凭证和参数词典即可操作。
- 在Siemens S7-1500可编程逻辑控制器上的案例研究表明,该方法成功率很高,专有GPT 5模型的准确率在96.0%至98.0%之间。
点此查看论文截图
Back to Ear: Perceptually Driven High Fidelity Music Reconstruction
Authors:Kangdi Wang, Zhiyue Wu, Dinghao Zhou, Rui Lin, Junyu Dai, Tao Jiang
Variational Autoencoders (VAEs) are essential for large-scale audio tasks like diffusion-based generation. However, existing open-source models often neglect auditory perceptual aspects during training, leading to weaknesses in phase accuracy and stereophonic spatial representation. To address these challenges, we propose εar-VAE, an open-source music signal reconstruction model that rethinks and optimizes the VAE training paradigm. Our contributions are threefold: (i) A K-weighting perceptual filter applied prior to loss calculation to align the objective with auditory perception. (ii) Two novel phase losses: a Correlation Loss for stereo coherence, and a Phase Loss using its derivatives–Instantaneous Frequency and Group Delay–for precision. (iii) A new spectral supervision paradigm where magnitude is supervised by all four Mid/Side/Left/Right components, while phase is supervised only by the LR components. Experiments show εar-VAE at 44.1kHz substantially outperforms leading open-source models across diverse metrics, showing particular strength in reconstructing high-frequency harmonics and the spatial characteristics.
变分自编码器(VAEs)对于基于扩散生成的大型音频任务至关重要。然而,现有的开源模型在训练过程中往往忽视了听觉感知方面,导致相位准确性和立体声空间表征的弱点。为了应对这些挑战,我们提出了εar-VAE,这是一个开源的音乐信号重建模型,它重新思考和优化了VAE训练范式。我们的贡献主要体现在三个方面:(i)在损失计算之前应用K-加权感知滤波器,以使目标与听觉感知相符。(ii)两种新型相位损失:用于立体声连贯性的相关性损失,以及使用其导数(即时频率和群延迟)的相位损失,以提高精度。(iii)一种新的频谱监督范式,其中幅度由四个中/侧/左/右组件进行监督,而相位仅由LR组件进行监督。实验表明,在44.1kHz下,εar-VAE在多种指标上的表现均优于领先的开源模型,特别是在重建高频谐波和空间特性方面表现出特别的优势。
论文及项目相关链接
PDF Check the Code here: https://github.com/Eps-Acoustic-Revolution-Lab/EAR_VAE and Model Weights here: https://huggingface.co/earlab/EAR_VAE
Summary
εar-VAE模型针对大型音频任务中的变分自编码器(VAEs)进行改进和优化,以提升在音频任务中的性能。通过应用听觉感知过滤器和创新阶段损失策略,以及新的频谱监督模式,εar-VAE能更有效地处理音乐信号重建问题,尤其在重建高频和谐音和空间特性方面表现优越。该模型对现有开源模型进行了实质性改进。
Key Takeaways
- εar-VAE是一个针对大型音频任务的变分自编码器(VAEs)改进模型。
- 模型通过应用K-weighting感知过滤器来对齐目标听觉感知。
- 模型引入两种新的阶段损失策略:用于立体声连贯性的相关性损失和用于精度的即时频率和群延迟的阶段损失。
- 模型采用新的频谱监督模式,对幅度进行全面监督并对阶段进行监督进行了细化。
点此查看论文截图
FSR-VLN: Fast and Slow Reasoning for Vision-Language Navigation with Hierarchical Multi-modal Scene Graph
Authors:Xiaolin Zhou, Tingyang Xiao, Liu Liu, Yucheng Wang, Maiyue Chen, Xinrui Meng, Xinjie Wang, Wei Feng, Wei Sui, Zhizhong Su
Visual-Language Navigation (VLN) is a fundamental challenge in robotic systems, with broad applications for the deployment of embodied agents in real-world environments. Despite recent advances, existing approaches are limited in long-range spatial reasoning, often exhibiting low success rates and high inference latency, particularly in long-range navigation tasks. To address these limitations, we propose FSR-VLN, a vision-language navigation system that combines a Hierarchical Multi-modal Scene Graph (HMSG) with Fast-to-Slow Navigation Reasoning (FSR). The HMSG provides a multi-modal map representation supporting progressive retrieval, from coarse room-level localization to fine-grained goal view and object identification. Building on HMSG, FSR first performs fast matching to efficiently select candidate rooms, views, and objects, then applies VLM-driven refinement for final goal selection. We evaluated FSR-VLN across four comprehensive indoor datasets collected by humanoid robots, utilizing 87 instructions that encompass a diverse range of object categories. FSR-VLN achieves state-of-the-art (SOTA) performance in all datasets, measured by the retrieval success rate (RSR), while reducing the response time by 82% compared to VLM-based methods on tour videos by activating slow reasoning only when fast intuition fails. Furthermore, we integrate FSR-VLN with speech interaction, planning, and control modules on a Unitree-G1 humanoid robot, enabling natural language interaction and real-time navigation.
视觉语言导航(VLN)是机器人系统中的一个基本挑战,具有在现实世界的环境中部署实体代理的广泛应用。尽管最近有进展,但现有方法在远程空间推理方面仍存在局限,通常表现出成功率低和推理延迟高的问题,特别是在远程导航任务中。为了克服这些局限性,我们提出了FSR-VLN,一种结合分层多模式场景图(HMSG)和快慢导航推理(FSR)的视觉语言导航系统。HMSG提供了一种多模式地图表示,支持从粗糙的房间级定位到精细的目标视图和对象识别的渐进检索。基于HMSG,FSR首先进行快速匹配,以有效地选择候选房间、视图和对象,然后应用VLM驱动的细化进行最终目标选择。我们在由人形机器人收集的四个综合室内数据集上评估了FSR-VLN,利用87条指令涵盖了各种对象类别。FSR-VLN在所有数据集上的检索成功率(RSR)均达到最新水平,同时在激活仅在快速直觉失败时才启动的慢速推理时,与基于VLM的方法相比,将响应时间减少了82%。此外,我们将FSR-VLN与Unitree-G1人形机器人的语音交互、规划和控制模块集成,实现自然语言交互和实时导航。
论文及项目相关链接
PDF 8 pages
Summary:
VLN(视觉语言导航)是机器人系统的一项基本挑战,广泛应用于真实世界环境中的智能代理部署。为解决现有方法在长距离空间推理上的局限性,我们提出了FSR-VLN系统,它结合了分层多模式场景图(HMSG)和快慢导航推理(FSR)。HMSG提供了多模式地图表示,支持从粗略的房间级定位到精细的目标视图和对象识别的渐进检索。FSR首先进行快速匹配,以高效选择候选房间、视图和对象,然后应用VLM驱动的细化进行最终目标选择。在由人形机器人收集的四个综合室内数据集上评估FSR-VLN,使用87条指令涵盖各种对象类别。FSR-VLN在所有数据集上的检索成功率(RSR)达到最新水平,并且通过将响应时间减少82%,提高了基于VLM的方法的效率。此外,我们将FSR-VLN与语音交互、规划和控制模块集成在Unitree-G1人形机器人上,实现自然语言交互和实时导航。
Key Takeaways:
- VLN在机器人系统中具有广泛应用,但现有方法存在长距离空间推理的局限性。
- 提出了一种新的视觉语言导航系统FSR-VLN,结合了分层多模式场景图(HMSG)和快慢导航推理(FSR)。
- HMSG提供了多模式地图表示,支持从粗略到精细的渐进检索。
- FSR通过快速匹配选择候选,然后应用VLM细化目标选择。
- 在四个室内数据集上评估,FSR-VLN达到最新性能水平,显著提高检索成功率(RSR),并大大减少响应时间。
- FSR-VLN与人形机器人集成,实现自然语言交互和实时导航。