⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-11-17 更新
MOSPA: Human Motion Generation Driven by Spatial Audio
Authors:Shuyang Xu, Zhiyang Dou, Mingyi Shi, Liang Pan, Leo Ho, Jingbo Wang, Yuan Liu, Cheng Lin, Yuexin Ma, Wenping Wang, Taku Komura
Enabling virtual humans to dynamically and realistically respond to diverse auditory stimuli remains a key challenge in character animation, demanding the integration of perceptual modeling and motion synthesis. Despite its significance, this task remains largely unexplored. Most previous works have primarily focused on mapping modalities like speech, audio, and music to generate human motion. As of yet, these models typically overlook the impact of spatial features encoded in spatial audio signals on human motion. To bridge this gap and enable high-quality modeling of human movements in response to spatial audio, we introduce the first comprehensive Spatial Audio-Driven Human Motion (SAM) dataset, which contains diverse and high-quality spatial audio and motion data. For benchmarking, we develop a simple yet effective diffusion-based generative framework for human MOtion generation driven by SPatial Audio, termed MOSPA, which faithfully captures the relationship between body motion and spatial audio through an effective fusion mechanism. Once trained, MOSPA can generate diverse, realistic human motions conditioned on varying spatial audio inputs. We perform a thorough investigation of the proposed dataset and conduct extensive experiments for benchmarking, where our method achieves state-of-the-art performance on this task. Our code and model are publicly available at https://github.com/xsy27/Mospa-Acoustic-driven-Motion-Generation
实现虚拟人物对多种听觉刺激的动态和真实反应仍是角色动画中的一项关键挑战,这需要感知建模和运动合成的集成。尽管其意义重大,但这项任务仍被大量忽视。以前的大部分工作主要集中在将语音、音频和音乐等映射模式转化为人类运动。然而,迄今为止,这些模型通常忽视了空间音频信号所编码的空间特征对人类运动的影响。为了弥补这一空白并实现高质量的空间音频驱动人类运动建模,我们首次推出了全面的空间音频驱动人类运动(SAM)数据集,其中包含多样且高质量的空间音频和运动数据。为了基准测试,我们开发了一种简单有效的基于扩散的生成框架,用于空间音频驱动的人类运动生成,称为MOSPA。它通过一个有效的融合机制忠实捕捉身体运动和空间音频之间的关系。一旦训练完成,MOSPA可以根据不同的空间音频输入生成多样且逼真的运动。我们对提出的数据集进行了全面调查并进行了广泛的基准测试实验,我们的方法在该任务上取得了最先进的性能。我们的代码和模型可在 https://github.com/xsy27/Mospa-Acoustic-driven-Motion-Generation 获得。
论文及项目相关链接
PDF NeurIPS 2025 (Spotlight)
Summary
这是一项关于空间音频驱动虚拟人类运动的研究。由于缺少相应的数据集和研究模型,此前对此任务的研究尚处于初步阶段。为此,研究者提出了首个空间音频驱动人体运动(SAM)数据集,旨在通过数据建模提高虚拟人物对空间音频的反应能力。此外,他们还开发了一种基于扩散模型的生成框架MOSPA,能够根据空间音频生成逼真的虚拟人体运动。该研究对虚拟角色的动作表现具有重要的提升作用。数据集和模型已经公开。
Key Takeaways
一、构建首个空间音频驱动人体运动(SAM)数据集,包含多样且高质量的空间音频和运动数据。
二、提出了一个基于扩散模型的生成框架MOSPA,能够根据空间音频生成虚拟人体运动。该框架能够有效地捕捉空间音频与人体运动之间的关系。
点此查看论文截图
DARAS: Dynamic Audio-Room Acoustic Synthesis for Blind Room Impulse Response Estimation
Authors:Chunxi Wang, Maoshen Jia, Wenyu Jin
Room Impulse Responses (RIRs) accurately characterize acoustic properties of indoor environments and play a crucial role in applications such as speech enhancement, speech recognition, and audio rendering in augmented reality (AR) and virtual reality (VR). Existing blind estimation methods struggle to achieve practical accuracy. To overcome this challenge, we propose the dynamic audio-room acoustic synthesis (DARAS) model, a novel deep learning framework that is explicitly designed for blind RIR estimation from monaural reverberant speech signals. First, a dedicated deep audio encoder effectively extracts relevant nonlinear latent space features. Second, the Mamba-based self-supervised blind room parameter estimation (MASS-BRPE) module, utilizing the efficient Mamba state space model (SSM), accurately estimates key room acoustic parameters and features. Third, the system incorporates a hybrid-path cross-attention feature fusion module, enhancing deep integration between audio and room acoustic features. Finally, our proposed dynamic acoustic tuning (DAT) decoder adaptively segments early reflections and late reverberation to improve the realism of synthesized RIRs. Experimental results, including a MUSHRA-based subjective listening study, demonstrate that DARAS substantially outperforms existing baseline models, providing a robust and effective solution for practical blind RIR estimation in real-world acoustic environments.
室内环境的声学特性可以通过房间脉冲响应(RIRs)来精准描述,其在语音增强、语音识别、增强现实(AR)和虚拟现实(VR)的音频渲染等应用中扮演着关键角色。现有的盲估计方法很难达到实用所需的准确度。为了应对这一挑战,我们提出了动态音频房间声学合成(DARAS)模型,这是一个专门为从单声道混响语音信号中盲估计RIR而设计的新型深度学习框架。首先,专用的深度音频编码器有效地提取了相关的非线性潜在空间特征。其次,基于Mamba的自我监督盲房间参数估计(MASS-BRPE)模块,利用高效的Mamba状态空间模型(SSM),准确估计了关键的房间声学参数和特征。第三,系统融入了一个混合路径交叉注意特征融合模块,增强了音频和房间声学特征之间的深度集成。最后,我们提出的动态声学调整(DAT)解码器自适应地分割了早期反射和后期回声,提高了合成RIR的真实感。包括基于MUSHRA的主观听觉研究在内的实验结果证明,DARAS在真实世界的声学环境中,相较于现有基线模型,大幅提高了盲估计RIR的实用性和准确性,为实际应用提供了稳健有效的解决方案。
论文及项目相关链接
PDF 14 pages, 9 figures, accepted for publication in IEEE/ACM Transactions on Audio, Speech, and Language Processing
Summary
本文提出了动态音频房间声学合成(DARAS)模型,这是一种专门为从单声道混响语音信号中盲估计房间脉冲响应(RIR)而设计的新型深度学习框架。该模型通过深度音频编码器提取非线性潜在空间特征,利用基于Mamba的状态空间模型(SSM)进行房间声学参数的自我监督盲估计,并结合混合路径交叉注意力特征融合模块,自适应分割早期反射和后期混响,提高了合成RIR的真实感。实验结果表明,DARAS在真实世界声学环境中实现了对基线模型的显著超越。
Key Takeaways
- 房间脉冲响应(RIR)是室内声学特性的准确表征,在语音增强、语音识别、增强现实(AR)和虚拟现实(VR)中的音频渲染等应用中发挥关键作用。
- 现有盲估计方法难以达到实用精度,DARAS模型被提出以解决这个问题。
- DARAS包括一个深度音频编码器用于提取非线性潜在空间特征,以及基于Mamba的状态空间模型进行房间声学参数的自我监督盲估计。
- 该模型结合了混合路径交叉注意力特征融合模块,增强了音频和房间声学特征之间的深度融合。
- 动态声学调音(DAT)解码器自适应分割早期反射和后期混响,提高了合成RIR的真实感。
- 实验结果表明DARAS模型在盲RIR估计方面显著优于现有基线模型。
点此查看论文截图
DeepOmni: Towards Seamless and Smart Speech Interaction with Adaptive Modality-Specific MoE
Authors:Hang Shao, Heting Gao, Yunhang Shen, Jiawei Chen, Zuwei Long, Dong Yang, Ke Li, Xing Sun
Native multimodal large language models (MLLMs) restructure a single large language model (LLM) into a spoken language model (SLM) capable of both speech and text generation. Compared to modular and aligned MLLMs, native MLLMs preserve richer paralinguistic features such as emotion and prosody, and generate speech responses directly within the backbone LLM rather than using a separate speech decoder. This integration also results in lower response latency and smoother interaction. However, native MLLMs suffer from catastrophic forgetting and performance degradation because the available paired speech-text data is insufficient to support the pretraining of MLLMs compared to the vast amount of text data required to pretrain text LLMs. To address this issue, we propose DeepTalk, a framework for adaptive modality expert learning based on a Mixture of Experts (MoE) architecture. DeepTalk first adaptively distinguishes modality experts according to their modality load within the LLM. Each modality expert then undergoes specialized single-modality training, followed by joint multimodal collaborative training. As a result, DeepTalk incurs only a 5.5% performance drop compared to the original LLM, which is significantly lower than the average performance drop of over 20% typically seen in native MLLMs (such as GLM-4-Voice), and is on par with modular MLLMs. Meanwhile, the end-to-end dialogue latency remains within 0.5 seconds, ensuring a seamless and intelligent speech interaction experience. Code and models are released at https://github.com/talkking/DeepTalk.
原生多模态大型语言模型(MLLMs)将单一的大型语言模型(LLM)重构为能够生成语音和文本的口语语言模型(SLM)。与模块化和对齐的MLLMs相比,原生MLLMs保留了更丰富的副语言特征,如情感和语调,并在主干LLM内直接生成语音响应,而不是使用单独的语音解码器。这种集成还带来了更低的响应延迟和更流畅的互动。然而,由于可用的配对语音-文本数据与预训练文本LLM所需的大量文本数据相比不足,原生MLLMs遭受灾难性遗忘和性能下降的问题。为了解决这个问题,我们提出了DeepTalk,一个基于专家混合(MoE)架构的自适应模态专家学习框架。DeepTalk首先根据LLM内的模态负载自适应地区分模态专家。然后,每个模态专家接受专门的单模态训练,接着进行联合多模态协作训练。因此,与原始LLM相比,DeepTalk仅产生5.5%的性能下降,这远低于原生MLLMs(如GLM-4-Voice)通常出现的超过20%的平均性能下降,并与模块化MLLMs相当。同时,端到端对话延迟保持在0.5秒内,确保流畅智能的语音交互体验。代码和模型已发布在https://github.com/talkking/DeepTalk。
论文及项目相关链接
PDF Under Review
摘要
原生多模态大型语言模型(MLLMs)将单一的大型语言模型(LLM)重组为能够生成语音和文本的语音语言模型(SLM)。相比模块化和对齐的MLLMs,原生MLLMs保留了更丰富的副语言特征,如情感和语调,并在主干LLM内部直接生成语音响应,而不是使用单独的语音解码器。但原生MLLMs面临灾难性遗忘和性能下降的问题,因为可用的配对语音-文本数据不足以支持MLLMs的预训练,与需要大量文本数据来预训练文本LLMs的情况相比。为解决此问题,我们提出了基于专家混合(MoE)架构的DeepTalk框架,进行自适应模态专家学习。DeepTalk首先根据LLM内的模态负载自适应地识别模态专家。然后每个模态专家接受专门的单模态训练,接着进行联合多模态协作训练。结果,DeepTalk与原始LLM相比仅性能下降5.5%,远低于原生MLLMs通常出现的超过20%的平均性能下降,并与模块化MLLMs相当。同时,端到端对话延迟保持在0.5秒内,确保流畅智能的语音交互体验。相关代码和模型已发布在https://github.com/talkking/DeepTalk。
关键见解
- 原生MLLMs能够结合语音和文本生成,保留丰富的副语言特征。
- 与模块化和对齐的MLLMs相比,原生MLLMs在主干LLM内直接生成语音响应。
- 原生MLLMs面临数据不足导致的性能问题和灾难性遗忘。
- DeepTalk框架通过自适应模态专家学习来解决这些问题。
- DeepTalk通过模态负载识别模态专家,进行单模态和联合多模态训练。
- DeepTalk与原始LLM相比性能下降较小,优于其他原生MLLMs,与模块化MLLMs相当。
点此查看论文截图
TTSOps: A Closed-Loop Corpus Optimization Framework for Training Multi-Speaker TTS Models from Dark Data
Authors:Kentaro Seki, Shinnosuke Takamichi, Takaaki Saeki, Hiroshi Saruwatari
This paper presents TTSOps, a fully automated closed-loop framework for constructing multi-speaker text-to-speech (TTS) systems from noisy, uncurated web-scale speech data, often referred to as ``dark data,’’ such as online videos. Conventional TTS training pipelines require well-curated corpora with high acoustic quality and accurate text-speech alignment, which severely limits scalability, speaker diversity, and real-world applicability. While recent studies have proposed acoustic-quality-based data selection techniques, they often overlook two critical aspects: (1) the inherent robustness of modern TTS models to noise, and (2) the potential contribution of perceptually low-quality yet informative samples. To address these issues, TTSOps introduces a data-centric training pipeline that integrates three core components: (1) automated data collection from dark data sources, (2) utterance-level dynamic selection of data cleansing methods based on training data quality, and (3) evaluation-in-the-loop data selection using automatically predicted mean opinion scores (MOS) to estimate each utterance’s impact on model performance. Furthermore, TTSOps jointly optimizes the corpus and the TTS model in a closed-loop framework by dynamically adapting both data selection and data cleansing processes to the characteristics of the target TTS model. Extensive experiments on Japanese YouTube data demonstrate that TTSOps outperforms conventional acoustic-quality-based baselines in both the naturalness and speaker diversity of synthesized speech.
本文介绍了TTSOps,这是一个全自动的闭环框架,用于从嘈杂的、未整理的网页规模语音数据(通常被称为“暗数据”,如在线视频)构建多说话人文本到语音(TTS)系统。传统的TTS训练管道需要高质量、语音文本对齐精确的语料库,这严重限制了可扩展性、说话人多样性和现实世界的适用性。虽然最近的研究已经提出了基于音质的数据选择技术,但它们往往忽略了两个关键方面:(1)现代TTS模型对噪声的固有鲁棒性;(2)感知质量较低但信息丰富的样本的潜在贡献。为了解决这些问题,TTSOps引入了一个以数据为中心的训练管道,该管道集成了三个核心组件:(1)从暗数据源自动收集数据,(2)基于训练数据质量的语句级动态选择数据清洗方法,(3)使用自动预测的平均意见分数(MOS)进行闭环数据选择,以估计每个语句对模型性能的影响。此外,TTSOps在一个闭环框架中联合优化语料库和TTS模型,通过动态适应数据选择和清洗过程来适应目标TTS模型的特征。在日语YouTube数据上的大量实验表明,TTSOps在合成语音的自然性和说话人多样性方面优于基于传统音质质量的基线。
论文及项目相关链接
PDF Accepted to IEEE Transactions on Audio, Speech and Language Processing
Summary
本文介绍了TTSOps,这是一个全自动的闭环框架,用于从嘈杂的、未整理的网页规模语音数据中构建多说话者文本到语音(TTS)系统。与传统TTS训练管道需要大量高质量和精确文本-语音对齐的语料库不同,TTSOps解决了可扩展性、说话者多样性和现实世界应用性的限制。它引入了一个以数据为中心的训练管道,包括自动化数据收集、基于训练数据质量的语句级别动态选择数据清洗方法,以及使用自动预测的平均意见分数(MOS)进行闭环数据选择。TTSOps在目标TTS模型的特性上动态适应数据选择和清洗过程,对日语YouTube数据的实验表明,TTSOps在自然度和说话者多样性方面优于基于声学质量的传统基线。
Key Takeaways
- TTSOps是一个全自动的闭环框架,用于构建多说话者TTS系统。
- 它解决了传统TTS训练管道在可扩展性、说话者多样性和现实世界应用方面的限制。
- TTSOps从所谓的“暗数据”(如在线视频)中构建TTS系统。
- 它整合了三个核心组件:自动化数据收集、语句级动态数据清洗和闭环数据选择。
- TTSOps使用自动预测的平均意见分数(MOS)来估计每个语句对模型性能的影响。
- 它根据目标TTS模型的特性动态适应数据选择和清洗过程。
点此查看论文截图
UniCUE: Unified Recognition and Generation Framework for Chinese Cued Speech Video-to-Speech Generation
Authors:Jinting Wang, Shan Yang, Chenxing Li, Dong Yu, Li Liu
Cued Speech (CS) enhances lipreading via hand coding, offering visual phonemic cues that support precise speech perception for the hearing-impaired. The task of CS Video-to-Speech generation (CSV2S) aims to convert CS videos into intelligible speech signals. Most existing research focuses on CS Recognition (CSR), which transcribes video content into text. Consequently, a common solution for CSV2S is to integrate CSR with a text-to-speech (TTS) system. However, this pipeline relies on text as an intermediate medium, which may lead to error propagation and temporal misalignment between speech and CS video dynamics. In contrast, directly generating audio speech from CS video (direct CSV2S) often suffers from the inherent multimodal complexity and the limited availability of CS data. To address these challenges, we propose UniCUE, the first unified framework for CSV2S that directly generates speech from CS videos without relying on intermediate text. The core innovation of UniCUE lies in integrating an understanding task (CSR) that provides fine-grained CS visual-semantic cues to guide speech generation. Specifically, UniCUE incorporates a pose-aware visual processor, a semantic alignment pool that enables precise visual-semantic mapping, and a VisioPhonetic adapter to bridge the understanding and generation tasks within a unified architecture. To support this framework, we construct UniCUE-HI, a large-scale Mandarin CS dataset containing 11282 videos from 14 cuers, including both hearing-impaired and normal-hearing individuals. Extensive experiments on this dataset demonstrate that UniCUE achieves state-of-the-art performance across multiple evaluation metrics.
Cued Speech(CS)通过手语编码增强唇读能力,提供视觉语音线索,支持听力受损者精确感知语音。CS视频到语音生成(CSV2S)的任务旨在将CS视频转换为可理解的语音信号。目前大多数研究集中在CS识别(CSR)上,即将视频内容转录为文本。因此,CSV2S的一种常见解决方案是将CSR与文本到语音(TTS)系统集成在一起。然而,这种管道依赖于文本作为中间媒介,这可能导致误差传播和语音与CS视频动态之间的时间不匹配。相比之下,直接从CS视频生成音频语音(直接CSV2S)常常受到固有的多模式复杂性和有限的CS数据可用性的困扰。为了应对这些挑战,我们提出了UniCUE,这是第一个用于CSV2S的统一框架,能够直接从CS视频生成语音,无需依赖中间文本。UniCUE的核心创新在于集成了理解任务(CSR),提供精细的CS视觉语义线索来指导语音生成。具体来说,UniCUE采用姿态感知视觉处理器、语义对齐池(使精确的视觉语义映射成为可能)和VisioPhonetic适配器,在一个统一架构内架起理解和生成任务之间的桥梁。为了支持此框架,我们构建了UniCUE-HI,这是一个大规模的普通话CS数据集,包含来自14名打手势者的11282个视频,其中包括听障人士和听力正常的人。在该数据集上进行的大量实验表明,UniCUE在多个评估指标上达到了最新技术水平。
论文及项目相关链接
PDF 13 pages, 12 figures
摘要
Cued Speech(CS)通过手语编码增强唇读能力,为听力受损者提供视觉音素线索以支持精确语音感知。CS视频到语音生成(CSV2S)的任务旨在将CS视频转换为可理解的语音信号。大多数现有研究集中在CS识别(CSR)上,将视频内容转录为文本。因此,CSV2S的常见解决方案是将CSR与文本到语音(TTS)系统结合。然而,这种管道依赖于作为中间介质的文本,可能导致误差传播和语音与CS视频动态之间的时间不匹配。相比之下,直接从CS视频生成音频语音(直接CSV2S)常常面临固有的多模式复杂性和CS数据有限可用性的挑战。为了应对这些挑战,我们提出了UniCUE,这是第一个用于CSV2S的统一框架,能够直接从CS视频生成语音,而无需依赖中间文本。UniCUE的核心创新在于整合了CSR这一理解任务,提供精细的CS视觉语义线索来指导语音生成。具体来说,UniCUE结合了姿态感知视觉处理器、语义对齐池(使精确视觉语义映射成为可能)和VisioPhonetic适配器(在理解和生成任务之间建立桥梁的统一架构)。为了支持此框架,我们构建了包含11282个视频和14个手语者的UniCUE-HI大规模普通话CS数据集。在该数据集上的广泛实验表明,UniCUE在多个评估指标上达到了最先进的性能。
关键见解
- Cued Speech(CS)通过手语编码增强唇读,为听力受损者提供视觉音素线索。
- CSV2S任务旨在将CS视频转换为可理解的语音信号,面临多模式复杂性和数据有限性挑战。
- 现有方法主要依赖文本作为中间介质,可能导致误差传播和时间不匹配。
- UniCUE是第一个直接生成语音的统一框架,不依赖中间文本。
- UniCUE整合了CSR任务,提供精细的CS视觉语义线索来指导语音生成。
- UniCUE结合了多个创新组件,包括姿态感知视觉处理器、语义对齐池和VisioPhonetic适配器。
点此查看论文截图
Enhancing Speech-to-Speech Dialogue Modeling with End-to-End Retrieval-Augmented Generation
Authors:Pengchao Feng, Ziyang Ma, Wenxi Chen, Yao Li, Sheng Wang, Kai Yu, Xie Chen
End-to-end speech-to-speech (S2S) dialogue systems have recently garnered increasing research attention for their lower latency and more natural integration of nonverbal cues such as emotion and speaker identity. However, these systems face key challenges, particularly in incorporating external knowledge, a capability commonly addressed by Retrieval-Augmented Generation (RAG) in text-based large language models (LLMs). The core difficulty lies in the modality gap between input speech and retrieved textual knowledge, which hinders effective integration of information. To address this issue, we propose a novel end-to-end RAG framework that directly retrieves relevant textual knowledge from speech queries. Experimental results demonstrate that our method significantly improves the performance of end-to-end S2S dialogue systems while achieving higher retrieval efficiency. Although the overall performance still lags behind the SOTA cascaded models, our framework offers a promising direction for enhancing knowledge integration in end-to-end S2S systems. Our code and dataset are released.
端到端语音到语音(S2S)对话系统因其较低的延迟和更自然地整合非言语线索(如情感和说话者身份)而最近引起了越来越多的研究关注。然而,这些系统面临关键挑战,特别是在整合外部知识方面,这是基于文本的大型语言模型(LLM)通常通过增强检索(RAG)来解决的问题。核心困难在于输入语音和检索到的文本知识之间的模态差距,这阻碍了信息的有效整合。为了解决这一问题,我们提出了一种新型的端到端RAG框架,该框架直接从语音查询中检索相关的文本知识。实验结果表明,我们的方法在端到端S2S对话系统的性能上有了显著提高,同时提高了检索效率。尽管总体性能仍然落后于最先进的级联模型,但我们的框架为增强端到端S2S系统中的知识整合提供了有前景的方向。我们的代码和数据集已经发布。
论文及项目相关链接
PDF Accepted to EMNLP 2025 Findings
Summary
本论文探讨了端到端的语音到语音(S2S)对话系统的最新研究趋势和挑战。为提高知识整合能力,提出一种新型端到端的检索增强生成(RAG)框架,可从语音查询中直接检索相关文本知识。实验证明,该方法显著提高了端到端S2S对话系统的性能,同时提高了检索效率,为增强端到端S2S系统中的知识整合提供了有前景的方向。
Key Takeaways
- 端到端的语音到语音(S2S)对话系统受到研究关注,因其低延迟和更自然的非语言线索整合能力。
- 面临的关键挑战是融入外部知识,这在文本型大语言模型(LLM)中通常通过检索增强生成(RAG)解决。
- 模态间隙是阻碍有效整合信息的主要难题,存在于输入语音和检索到的文本知识之间。
- 提出一种新型端到端的RAG框架,直接从语音查询中检索相关文本知识。
- 实验证明该方法提高了S2S对话系统的性能和检索效率。
- 尽管总体性能仍落后于级联模型,但该框架为增强端到端S2S系统中的知识整合提供了有前途的方向。
- 代码和数据集已公开。
点此查看论文截图
Efficient Long-duration Talking Video Synthesis with Linear Diffusion Transformer under Multimodal Guidance
Authors:Haojie Zhang, Zhihao Liang, Ruibo Fu, Bingyan Liu, Zhengqi Wen, Xuefei Liu, Jianhua Tao, Yaling Liang
Long-duration talking video synthesis faces enduring challenges in achieving high video quality, portrait and temporal consistency, and computational efficiency. As video length increases, issues such as visual degradation, identity inconsistency, temporal incoherence, and error accumulation become increasingly problematic, severely affecting the realism and reliability of the results. To address these challenges, we present LetsTalk, a diffusion transformer framework equipped with multimodal guidance and a novel memory bank mechanism, explicitly maintaining contextual continuity and enabling robust, high-quality, and efficient generation of long-duration talking videos. In particular, LetsTalk introduces a noise-regularized memory bank to alleviate error accumulation and sampling artifacts during extended video generation. To further improve efficiency and spatiotemporal consistency, LetsTalk employs a deep compression autoencoder and a spatiotemporal-aware transformer with linear attention for effective multimodal fusion. We systematically analyze three fusion schemes and show that combining deep (Symbiotic Fusion) for portrait features and shallow (Direct Fusion) for audio achieves superior visual realism and precise speech-driven motion, while preserving diversity of movements. Extensive experiments demonstrate that LetsTalk establishes new state-of-the-art in generation quality, producing temporally coherent and realistic talking videos with enhanced diversity and liveliness, and maintains remarkable efficiency with 8x fewer parameters than previous approaches.
长时语音视频合成面临着实现高质量视频、肖像和时序一致性以及计算效率的持久挑战。随着视频长度的增加,视觉退化、身份不一致、时序不一致和误差累积等问题变得越来越严重,严重影响了结果的真实性和可靠性。为了解决这些挑战,我们推出了LetsTalk,这是一个配备了多模式指导和新颖记忆库机制的扩散变压器框架,明确地保持上下文连续性,并能够实现稳健、高质量和高效的长时语音视频生成。特别地,LetsTalk引入了一个噪声正则化记忆库,以减轻扩展视频生成过程中的误差累积和采样伪影。为了进一步提高效率和时空一致性,LetsTalk采用深度压缩自动编码器和具有线性注意力的时空感知变压器,以实现有效的多模式融合。我们系统地分析了三种融合方案,并表明深度(共生融合)用于肖像特征和浅层(直接融合)用于音频的结合实现了卓越的视频真实感和精确的语言驱动运动,同时保持运动的多样性。大量实验表明,LetsTalk在生成质量方面达到了最新水平,生成了时序一致和现实感强的语音视频,增强了多样性和生动性,并且在参数方面比以前的方法少了8倍,仍然保持了惊人的效率。
论文及项目相关链接
PDF 10 pages, 7 figures
摘要
长时程对话视频合成在生成高质量视频时面临多方面的挑战,包括实现高视频质量、保持人像与时间的连贯一致性以及计算效率的提升。随着视频长度的增加,视觉失真、身份不一致、时间不一致和误差累积等问题愈发严重,严重影响了结果的逼真度和可靠性。为解决这些挑战,我们提出了名为LetsTalk的扩散变换框架,它配备了多模态指导和一种新型记忆库机制,能够明确地保持上下文连续性,从而实现了稳健、高质量和高效的长时程对话视频生成。特别地,LetsTalk引入了一种噪声正则化记忆库,以减轻在生成长视频过程中的误差累积和采样伪影问题。为进一步改善效率和时空一致性,LetsTalk采用深度压缩自编码器以及具有线性注意力的时空感知变换器,以实现有效的多模态融合。我们系统地分析了三种融合方案,并发现采用深度融合(共生融合)进行人像特征融合和浅层融合(直接融合)进行音频融合的方法,可实现最佳的视觉逼真度和精确的语音驱动运动效果,同时保持动作多样性的保留。大量实验表明,LetsTalk在生成质量方面树立了新的业界标杆,能够生成时间连贯且逼真的对话视频,增强了多样性和生动性,并且在参数使用上较之前的方法减少了8倍,保持了显著的高效性。
关键见解
- 长时程对话视频合成面临高视频质量、人像和时间连贯一致性以及计算效率的三大挑战。
- 随着视频长度增加,视觉失真、身份不一致等问题愈发严重。
- LetsTalk采用扩散变换框架,配备多模态指导和记忆库机制应对上述挑战。
- 噪声正则化记忆库用于减轻误差累积和采样伪影。
- 深度融合与浅层融合结合方法实现最佳视觉逼真度和语音驱动运动效果。
- LettsTalk在生成质量上树立新标杆,生成视频具有时间连贯性、逼真度、多样性和生动性。