⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-10-22 更新
DELULU: Discriminative Embedding Learning Using Latent Units for Speaker-Aware Self-Supervised Speech Foundational Model
Authors:Massa Baali, Rita Singh, Bhiksha Raj
Self-supervised speech models have achieved remarkable success on content-driven tasks, yet they remain limited in capturing speaker-discriminative features critical for verification, diarization, and profiling applications. We introduce DELULU, a speaker-aware self-supervised foundational model that addresses this limitation by integrating external supervision into the pseudo-label generation process. DELULU leverages frame-level embeddings from ReDimNet, a state-of-the-art speaker verification model, to guide the k-means clustering step during pre-training, introducing a strong speaker-discriminative inductive bias that aligns representation learning with speaker identity. The model is trained using a dual objective that combines masked prediction and denoising, further enhancing robustness and generalization. DELULU significantly outperforms prior self-supervised learning (SSL) models across a range of speaker-centric tasks, achieving up to 62% relative improvement in equal error rate (EER) for speaker verification and consistent gains on zero-shot profiling tasks such as gender, age, accent, and speaker counting. Our findings demonstrate that DELULU is a strong universal encoder for speaker-aware speech processing, enabling superior performance even without task-specific fine-tuning.
自监督语音模型在内容驱动的任务上取得了显著的成功,但在捕获对验证、分档和轮廓描绘应用至关重要的区分说话人的特征方面仍存在局限性。我们引入了DELULU,这是一个具有说话人意识的自监督基础模型,它通过伪标签生成过程中融入外部监督来解决这一局限性。DELULU利用来自最新语音验证模型ReDimNet的帧级嵌入,来指导预训练过程中的k-均值聚类步骤,引入一个强大的说话人区分归纳偏见,使表征学习与说话人身份保持一致。该模型采用结合掩码预测和去噪的双重目标进行训练,进一步增强了稳健性和泛化能力。DELULU在多种以说话人为中心的任务上显著优于先前的自监督学习(SSL)模型,在等错误率(EER)方面实现了高达62%的相对改进(用于语音验证任务),并且在零样本轮廓描绘任务(如性别、年龄、口音和说话人数统计)上表现一致。我们的研究结果表明,DELULU是一个强大的通用编码器,适用于具有说话人意识的语音处理,即使在无需特定任务微调的情况下也能实现卓越性能。
论文及项目相关链接
Summary
本文介绍了DELULU这一具有说话者意识的自监督基础模型。该模型通过集成外部监督到伪标签生成过程中,解决了现有自监督语音模型在捕获说话者辨识特征方面的局限性。DELULU利用ReDimNet的帧级嵌入来指导预训练中的K-means聚类步骤,引入了强说话者辨识的归纳偏置,使表征学习与说话者身份对齐。该模型通过结合掩码预测和去噪的双目标进行训练,增强了稳健性和泛化能力。DELULU在多种以说话者为中心的任务上显著优于先前的自监督学习模型,在说话者验证的等错误率上实现了高达62%的相对改进,并在零样本分析任务上取得了持续的收益。
Key Takeaways
- DELULU是一个具有说话者意识的自监督基础模型,解决了现有语音模型在捕获说话者辨识特征上的局限性。
- DELULU通过集成外部监督到伪标签生成过程中,提高了模型的性能。
- ReDimNet的帧级嵌入在DELULU的预训练中起到重要作用,为K-means聚类提供了指导,引入了强说话者辨识的归纳偏置。
- DELULU通过结合掩码预测和去噪的双目标进行训练,增强了模型的稳健性和泛化能力。
- DELULU在多种以说话者为中心的任务上表现出卓越性能,包括说话者验证、性别、年龄、口音和说话者计数等任务。
- DELULU实现了高达62%的相对改进在说话者验证的等错误率上,显示出其优越的性能。
点此查看论文截图
Schrödinger Bridge Mamba for One-Step Speech Enhancement
Authors:Jing Yang, Sirui Wang, Chao Wu, Fan Fan
We propose Schr"odinger Bridge Mamba (SBM), a new concept of training-inference framework motivated by the inherent compatibility between Schr"odinger Bridge (SB) training paradigm and selective state-space model Mamba. We exemplify the concept of SBM with an implementation for generative speech enhancement. Experiments on a joint denoising and dereverberation task using four benchmark datasets demonstrate that SBM, with only 1-step inference, outperforms strong baselines with 1-step or iterative inference and achieves the best real-time factor (RTF). Beyond speech enhancement, we discuss the integration of SB paradigm and selective state-space model architecture based on their underlying alignment, which indicates a promising direction for exploring new deep generative models potentially applicable to a broad range of generative tasks. Demo page: https://sbmse.github.io
我们提出了Schrödinger Bridge Mamba(SBM)这一新的训练推理框架概念,其灵感来源于Schrödinger Bridge(SB)训练范式与选择性状态空间模型Mamba之间的内在兼容性。我们通过生成式语音增强的实现来举例说明SBM的概念。使用四个基准数据集进行的联合去噪和消混响任务实验表明,SBM只需进行1步推理,即可超越具有1步或迭代推理的强劲基准线,并达到最佳实时因子(RTF)。除了语音增强,我们还讨论了基于SB范式和选择性状态空间模型架构的内在对齐进行集成的方法,这为我们探索可广泛应用于各种生成任务的新型深度生成模型提供了有希望的方向。Demo页面:https://sbmse.github.io
论文及项目相关链接
PDF 5 pages, 1 figure
Summary
本文提出了基于薛定谔桥(SB)训练范式和选择性状态空间模型Mamba的内在兼容性的新训练推理框架——薛定谔桥Mamba(SBM)。通过生成语音增强的实现来展示SBM的概念。在联合去噪和去混响任务上的四个基准数据集的实验表明,SBM仅一步推理就超越了具有一步或迭代推理的强基线,并获得了最佳实时因子(RTF)。除了语音增强,本文还讨论了SB范式和选择性状态空间模型架构的集成,基于它们的基础对齐,这为探索可广泛应用于各种生成任务的有前途的深层生成模型指明了方向。
Key Takeaways
- 提出了基于薛定谔桥(SB)训练范式和选择性状态空间模型Mamba的新训练推理框架——薛定谔桥Mamba(SBM)。
- SBM通过生成语音增强的实现来展示其概念。
- 在联合去噪和去混响任务上的实验表明,SBM具有出色的性能,仅一步推理就超越了强基线。
- SBM获得了最佳实时因子(RTF)。
- 除了语音增强,SBM的集成还讨论了SB范式和选择性状态空间模型架构的集成。
- SB范式和选择性状态空间模型架构的基础对齐为探索新的深层生成模型提供了方向。
点此查看论文截图
SHANKS: Simultaneous Hearing and Thinking for Spoken Language Models
Authors:Cheng-Han Chiang, Xiaofei Wang, Linjie Li, Chung-Ching Lin, Kevin Lin, Shujie Liu, Zhendong Wang, Zhengyuan Yang, Hung-yi Lee, Lijuan Wang
Current large language models (LLMs) and spoken language models (SLMs) begin thinking and taking actions only after the user has finished their turn. This prevents the model from interacting during the user’s turn and can lead to high response latency while it waits to think. Consequently, thinking after receiving the full input is not suitable for speech-to-speech interaction, where real-time, low-latency exchange is important. We address this by noting that humans naturally “think while listening.” In this paper, we propose SHANKS, a general inference framework that enables SLMs to generate unspoken chain-of-thought reasoning while listening to the user input. SHANKS streams the input speech in fixed-duration chunks and, as soon as a chunk is received, generates unspoken reasoning based on all previous speech and reasoning, while the user continues speaking. SHANKS uses this unspoken reasoning to decide whether to interrupt the user and to make tool calls to complete the task. We demonstrate that SHANKS enhances real-time user-SLM interaction in two scenarios: (1) when the user is presenting a step-by-step solution to a math problem, SHANKS can listen, reason, and interrupt when the user makes a mistake, achieving 37.1% higher interruption accuracy than a baseline that interrupts without thinking; and (2) in a tool-augmented dialogue, SHANKS can complete 56.9% of the tool calls before the user finishes their turn. Overall, SHANKS moves toward models that keep thinking throughout the conversation, not only after a turn ends. Animated illustrations of Shanks can be found at https://d223302.github.io/SHANKS/
当前的大型语言模型(LLM)和口语模型(SLM)只有在用户完成其发言后才开始思考和行动。这阻止了模型在用户发言时的交互,并可能导致响应延迟。因此,在接收完整输入后再思考不适合语音到语音的交互,其中实时、低延迟的交互非常重要。我们通过注意到人类自然地“边听边想”来解决这个问题。在本文中,我们提出SHANKS,这是一种通用推理框架,它使SLM能够在听取用户输入时生成未表达的思维链推理。SHANKS以固定持续时间的块流输入语音,在收到一个块后,立即基于先前的语音和推理生成未表达的推理,而用户仍在继续说话。SHANKS使用这种未表达的推理来决定是否中断用户并进行工具调用以完成任务。我们证明,SHANKS在两种情况下增强了实时用户-SLM交互:(1)当用户逐步解决数学问题时,SHANKS可以倾听、推理,并在用户犯错误时中断,其中断准确性比在没有思考的情况下中断的基线高出37.1%;(2)在工具增强的对话中,SHANKS可以在用户完成其发言之前完成56.9%的工具调用。总的来说,SHANKS正朝着使模型在整个对话过程中保持思考的方向发展,而不仅仅是在一轮对话结束后。有关Shanks的动画插图,请访问:[https://d223302.github.io/SHANKS/]
论文及项目相关链接
PDF Work in progress
Summary:
现有大型语言模型(LLM)和口语语言模型(SLM)在用户结束发言后才进行思考并行动,导致交互时响应延迟较高,不适合进行语音到语音的实时交互。本文提出SHANKS这一通用推理框架,使SLM在用户输入时能够边听边思考,生成未说出的思维链。SHANKS通过固定时长块接收语音输入,并在接收到一块后即基于之前的语音和推理生成未说出的思考。实验表明,SHANKS在两种场景下增强了用户与SLM的实时交互:在用户逐步解决数学问题及工具辅助对话中,SHANKS能够在用户发言过程中进行思考和中断,提高了中断准确性和工具使用效率。
Key Takeaways:
- 现有语言模型在用户结束发言后才进行思考,导致交互响应延迟。
- SHANKS框架能让SLM在用户输入时边听边思考,生成未说出的思维链。
- SHANKS通过固定时长块接收语音输入,并快速生成基于之前语音和推理的未说出的思考。
- 在解决数学问题时,SHANKS能实时监听、推理,并在用户犯错时中断,提高中断准确性。
- 在工具辅助对话中,SHANKS能在用户完成发言前完成大部分工具调用,提高工具使用效率。
- SHANKS实现了模型在对话过程中的持续思考,不仅限于在结束一回合后的思考。
点此查看论文截图
Speech Foundation Models Generalize to Time Series Tasks from Wearable Sensor Data
Authors:Jaya Narain, Zakaria Aldeneh, Shirley Ren
Both speech and sensor time series data encode information in both the time- and frequency- domains, like spectral powers and waveform shapelets. We show that speech foundation models learn representations that generalize beyond the speech domain and achieve state-of-the-art performance on diverse time-series tasks from wearable sensors. Probes trained on features extracted from HuBERT and wav2vec 2.0 outperform those extracted from self-supervised models trained directly on modality-specific datasets for mood classification, arrhythmia detection, and activity classification tasks. We find that the convolutional feature encoders of speech models are particularly relevant for wearable sensor applications. The proposed approach enhances performance on data-scarce time-series tasks using simple probing methods. This work takes a step toward developing generalized time-series models that unify speech and sensor modalities.
语音和传感器时间序列数据在时间和频率域中都编码信息,如谱功率和波形形状。我们展示语音基础模型学习到的表示能够推广到语音域之外,并在多样化的时间序列任务上实现了最先进的性能表现,这些任务来自可穿戴传感器。使用HuBERT和wav2vec 2.0提取的特征训练的探针在情绪分类、心律失常检测和活动分类任务上的表现优于直接从特定模态数据集上训练的自我监督模型提取的特征。我们发现语音模型的卷积特征编码器对于可穿戴传感器应用特别重要。所提出的方法使用简单的探测方法提高了数据稀缺时间序列任务的性能。这项工作朝着开发统一语音和传感器模态的通用时间序列模型迈出了一步。
论文及项目相关链接
PDF Preprint, under review
Summary
本文探讨了语音和传感器时间序列数据在时间和频率域中的信息编码,如谱功率和波形特征。研究展示了基于语音的基础模型学习到的表示方法能够跨语音领域进行推广,并在多样化的时间序列任务上实现最先进的性能,这些任务包括可穿戴传感器的情绪分类、心律失常检测和活动分类。研究中发现,语音模型的卷积特征编码器对可穿戴传感器应用特别重要。通过简单的探测方法,该研究提高了数据稀缺时间序列任务的性能。本研究朝着开发统一语音和传感器模态的通用时间序列模型迈出了重要一步。
Key Takeaways
- 语音和传感器时间序列数据在时间和频率域中都包含信息,如谱功率和波形特征。
- 基于语音的基础模型可以学习到跨语音领域的表示方法。
- 这些模型在多样化的时间序列任务上实现了最先进的性能,包括情绪分类、心律失常检测和活动分类等可穿戴传感器任务。
- 语音模型的卷积特征编码器对可穿戴传感器应用特别重要。
- 通过简单的探测方法,该研究表明能够提高数据稀缺时间序列任务的性能。
- 此研究为开发统一的语音和传感器模态的通用时间序列模型奠定了基础。
点此查看论文截图
Late Fusion and Multi-Level Fission Amplify Cross-Modal Transfer in Text-Speech LMs
Authors:Santiago Cuervo, Adel Moumen, Yanis Labrak, Sameer Khurana, Antoine Laurent, Mickael Rouvier, Phil Woodland, Ricard Marxer
Text-Speech Language Models (TSLMs) – language models trained to jointly process and generate text and speech – are commonly trained through an early modality fusion/fission approach, in which both modalities are fed and predicted from a shared backbone via linear layers. We hypothesize that this approach limits cross-modal transfer by neglecting feature compositionality – specifically, the finer-grained nature of speech representations compared to text – preventing the emergence of a shared feature hierarchy within model layers. In this paper, we argue that this limitation can be addressed through late fusion and fission, with a fission process that accesses both high- and low-level features for speech generation. Our models implementing these principles, SmolTolk, rival or surpass state-of-the-art TSLMs trained with orders of magnitude more compute, and achieve significantly improved cross-modal performance relative to early fusion/fission baselines. Representation analyses further suggest that our method enhances the model’s ability to abstract higher-level, more semantic features from speech, and leads to increasingly shared representation spaces across layers.
文本-语音语言模型(TSLMs)——训练用于联合处理和生成文本和语音的语言模型——通常通过早期的模态融合/分裂方法训练,其中通过线性层从共享主干输入并预测两种模态。我们假设这种方法忽略了特征组合性,即与文本相比,语音表示的粒度更精细,限制了跨模态转移,阻碍了模型层内共享特征层次结构的出现。在本文中,我们认为通过后期融合和分裂可以解决这一限制,分裂过程可以访问用于语音生成的高、低层次特征。实现这些原则的模型SmolTolk与最先进的TSLM相比,计算量高出数倍,相对于早期融合分裂基线实现了显著的跨模态性能提升。表征分析进一步表明,我们的方法提高了模型从语音中抽象出高级、更语义特征的能力,并导致逐层之间越来越共享表示空间。
论文及项目相关链接
Summary
文本主要介绍了文本语音语言模型(TSLMs)的常见训练方式是通过早期模态融合/分裂法,采用线性层在共享骨干中进行处理和预测。该研究假设早期模态融合/分裂方法忽略了特征组合性,导致跨模态迁移受限。为了解决这个问题,研究者提出了基于后期融合和分裂的方法,特别是分裂过程能够访问语音生成的高、低层次特征。实现的模型SmolTolk与采用大量计算训练的先进TSLM相比具有相当的性能,并且相对于早期融合分裂基线显示出显著改进的跨模态性能。分析显示该方法提高了模型从语音中提取高级语义特征的能力,并且有助于在不同层级建立共享表示空间。
Key Takeaways
- TSLMs通常使用早期模态融合/分裂法进行训练,通过共享骨干和线性层处理预测两种模态。
- 早期模态融合/分裂方法忽略了特征组合性,限制了跨模态迁移能力。
- 研究者提出基于后期融合和分裂的方法来解决这一问题,特别是通过分裂过程获取语音生成的高、低层次特征。
- SmolTolk模型在跨模态性能上表现出强大的能力,与大量计算训练的先进TSLM相比具有相当的性能。
- SmolTolk模型相对于早期融合分裂基线显示出显著改进。
- 表示分析表明,该方法提高了模型从语音中提取高级语义特征的能力。
点此查看论文截图
Nexus: An Omni-Perceptive And -Interactive Model for Language, Audio, And Vision
Authors:Che Liu, Yingji Zhang, Dong Zhang, Weijie Zhang, Chenggong Gong, Yu Lu, Shilin Zhou, Ziliang Gan, Ziao Wang, Haipang Wu, Ji Liu, André Freitas, Qifan Wang, Zenglin Xu, Rongjuncheng Zhang, Yong Dai
This work proposes an industry-level omni-modal large language model (LLM) pipeline that integrates auditory, visual, and linguistic modalities to overcome challenges such as limited tri-modal datasets, high computational costs, and complex feature alignments. Our pipeline consists of three main components: First, a modular framework enabling flexible configuration of various encoder-LLM-decoder architectures. Second, a lightweight training strategy that pre-trains audio-language alignment on the state-of-the-art vision-language model Qwen2.5-VL, thus avoiding the costly pre-training of vision-specific modalities. Third, an audio synthesis pipeline that generates high-quality audio-text data from diverse real-world scenarios, supporting applications such as Automatic Speech Recognition and Speech-to-Speech chat. To this end, we introduce an industry-level omni-modal LLM, Nexus. Extensive experiments validate the efficacy of our pipeline, yielding the following key findings:(1) In the visual understanding task, Nexus exhibits superior performance compared with its backbone model - Qwen2.5-VL-7B, validating the efficiency of our training strategy. (2) Within the English Spoken Question-Answering task, the model achieves better accuracy than the same-period competitor (i.e, MiniCPM-o2.6-7B) in the LLaMA Q. benchmark. (3) In our real-world ASR testset, Nexus achieves outstanding performance, indicating its robustness in real scenarios. (4) In the Speech-to-Text Translation task, our model outperforms Qwen2-Audio-Instruct-7B. (5) In the Text-to-Speech task, based on pretrained vocoder (e.g., Fishspeech1.4 or CosyVoice2.0), Nexus is comparable to its backbone vocoder on Seed-TTS benchmark. (6) An in-depth analysis of tri-modal alignment reveals that incorporating the audio modality enhances representational alignment between vision and language.
本文提出了一种跨行业的多模态大型语言模型(LLM)管道,该管道融合了听觉、视觉和语言学模态,以克服如有限的三模态数据集、高计算成本和复杂特征对齐等挑战。我们的管道主要由三个部分组成:首先,一个模块化框架,能够灵活配置各种编码器-LLM-解码器架构。其次,一种轻量级的训练策略,通过对最先进的视觉语言模型Qwen2.5-VL进行音频语言对齐的预训练,从而避免了针对特定视觉模态的昂贵预训练。最后,一个音频合成管道,其从各种真实场景生成高质量音频文本数据,支持如自动语音识别和语音到语音聊天等应用。为此,我们引入了行业级别的多模态LLM,Nexus。大量的实验验证了我们管道的有效性,并产生了以下关键发现:(1)在视觉理解任务中,Nexus相较于其骨干模型Qwen2.5-VL-7B展现出卓越性能,验证了我们的训练策略的效率。(2)在英语口语问答任务中,该模型在LLaMA Q. benchmark上实现了比同期竞争对手(即MiniCPM-o2.6-7B)更高的准确性。(3)在我们的真实世界ASR测试集中,Nexus实现了卓越性能,表明其在真实场景中的稳健性。(4)在语音到文本翻译任务中,我们的模型超越了Qwen2-Audio-Instruct-7B。(5)在文本到语音任务中,基于预训练的vocoder(例如Fishspeech1.4或CosyVoice2.0),Nexus在Seed-TTS benchmark上的表现与其骨干vocoder相当。(6)对三模态对齐的深入分析表明,加入音频模态增强了视觉和语言之间的代表性对齐。
论文及项目相关链接
PDF Project: https://github.com/HiThink-Research/NEXUS-O
Summary
本文提出了一种跨行业的多模态大型语言模型(LLM)管道,该管道融合了听觉、视觉和语言学三大模态,以克服如有限的三模态数据集、高昂的计算成本和复杂的特征对齐等挑战。该管道包括三个主要组件:一是模块化框架,可灵活配置各种编码器-LLM-解码器架构;二是轻量级训练策略,通过预训练音频语言对齐在先进的视觉语言模型Qwen2.5-VL上,避免了昂贵的视觉特定模态的预训练;三是音频合成管道,从各种真实场景中生成高质量音频文本数据,支持如自动语音识别和语音到语音聊天等应用。实验结果表明,Nexus模型在视觉理解任务、英语问答任务、自动语音识别、语音翻译和文本到语音任务中均表现出卓越性能。此外,深度分析显示,加入音频模态增强了视觉和语言的代表性对齐。
Key Takeaways
- 提出了一种多模态大型语言模型(LLM)管道,集成了听觉、视觉和语言学三大模态,以应对挑战如有限数据集、高计算成本和复杂特征对齐。
- 管道包含模块化框架、轻量级训练策略和音频合成管道三个主要组件。
- Nexus模型在视觉理解任务、英语问答任务、自动语音识别、语音翻译和文本到语音任务中表现出卓越性能。
- 通过预训练音频语言对齐在先进的视觉语言模型上,避免了昂贵的视觉特定模态的预训练。
- 深度分析显示,加入音频模态增强了视觉和语言的代表性对齐。
- 该模型具有广泛的应用前景,如自动语音识别和语音到语音聊天等。
点此查看论文截图
Efficient Long-duration Talking Video Synthesis with Linear Diffusion Transformer under Multimodal Guidance
Authors:Haojie Zhang, Zhihao Liang, Ruibo Fu, Bingyan Liu, Zhengqi Wen, Xuefei Liu, Jianhua Tao, Yaling Liang
Long-duration talking video synthesis faces enduring challenges in achieving high video quality, portrait and temporal consistency, and computational efficiency. As video length increases, issues such as visual degradation, identity inconsistency, temporal incoherence, and error accumulation become increasingly problematic, severely affecting the realism and reliability of the results. To address these challenges, we present LetsTalk, a diffusion transformer framework equipped with multimodal guidance and a novel memory bank mechanism, explicitly maintaining contextual continuity and enabling robust, high-quality, and efficient generation of long-duration talking videos. In particular, LetsTalk introduces a noise-regularized memory bank to alleviate error accumulation and sampling artifacts during extended video generation. To further improve efficiency and spatiotemporal consistency, LetsTalk employs a deep compression autoencoder and a spatiotemporal-aware transformer with linear attention for effective multimodal fusion. We systematically analyze three fusion schemes and show that combining deep (Symbiotic Fusion) for portrait features and shallow (Direct Fusion) for audio achieves superior visual realism and precise speech-driven motion, while preserving diversity of movements. Extensive experiments demonstrate that LetsTalk establishes new state-of-the-art in generation quality, producing temporally coherent and realistic talking videos with enhanced diversity and liveliness, and maintains remarkable efficiency with 8x fewer parameters than previous approaches.
长时间对话视频合成面临着实现高质量视频、肖像和时序一致性以及计算效率的持久挑战。随着视频长度的增加,视觉退化、身份不一致、时序不一致和误差累积等问题变得越来越严重,严重影响了结果的逼真度和可靠性。为了解决这些挑战,我们推出了LetsTalk,这是一个配备多模式指导和新型记忆库机制的扩散变换框架,能够明确保持上下文连续性,并实现稳健、高质量和高效的长时长对话视频生成。特别地,LetsTalk引入了一个噪声正则化记忆库,以减轻扩展视频生成过程中的误差累积和采样伪影。为了进一步提高效率和时空一致性,LetsTalk采用深度压缩自编码器和具有线性注意力的时空感知变换器,以实现有效的多模式融合。我们系统地分析了三种融合方案,并表明深度(共生融合)用于肖像特征和浅度(直接融合)用于音频的结合实现了卓越的视觉逼真度和精确的语音驱动运动,同时保持了运动的多样性。大量实验表明,LetsTalk在生成质量方面达到了最新水平,生成了时序连贯和逼真的对话视频,增强了多样性和生动性,并在参数方面保持了显著的效率,只有以前方法的八分之一参数。
论文及项目相关链接
PDF 10 pages, 7 figures
Summary
这是一项关于长期对话视频合成的挑战的研究。研究中提出了LetsTalk模型,采用扩散转换器框架和多模态指导,通过新型内存库机制显式维持上下文连续性,实现稳健、高质量、高效的长期对话视频生成。该模型解决了视觉退化、身份不一致、时间不一致和误差累积等问题,提高了生成视频的真实性和可靠性。通过深度压缩自编码器和时空感知转换器,实现了高效和多模态融合。研究表明,融合方案结合深融合用于肖像特征和浅融合用于音频,可获得最佳的视觉真实感和精确的语音驱动运动,同时保持运动的多样性。LetTalk在生成质量上达到新的水平,生成了时间连贯和现实性强的对话视频,且具备卓越的运算效率。
Key Takeaways
- 长期对话视频合成面临高视频质量、肖像和时间的连贯性,以及计算效率的挑战。
- LetTalk模型采用扩散转换器框架和多模态指导来解决这些问题。
- LetTalk使用噪声正则化内存库减轻错误累积和采样时的伪影问题。
- 通过深度压缩自编码器和时空感知转换器提高了效率和时空连贯性。
- 研究表明,结合深融合用于肖像特征和浅融合用于音频可获得最佳效果。
- LetTalk模型在生成质量上达到新的水平,生成了更连贯、真实的对话视频。
点此查看论文截图