⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-11-19 更新
FoleyBench: A Benchmark For Video-to-Audio Models
Authors:Satvik Dixit, Koichi Saito, Zhi Zhong, Yuki Mitsufuji, Chris Donahue
Video-to-audio generation (V2A) is of increasing importance in domains such as film post-production, AR/VR, and sound design, particularly for the creation of Foley sound effects synchronized with on-screen actions. Foley requires generating audio that is both semantically aligned with visible events and temporally aligned with their timing. Yet, there is a mismatch between evaluation and downstream applications due to the absence of a benchmark tailored to Foley-style scenarios. We find that 74% of videos from past evaluation datasets have poor audio-visual correspondence. Moreover, they are dominated by speech and music, domains that lie outside the use case for Foley. To address this gap, we introduce FoleyBench, the first large-scale benchmark explicitly designed for Foley-style V2A evaluation. FoleyBench contains 5,000 (video, ground-truth audio, text caption) triplets, each featuring visible sound sources with audio causally tied to on-screen events. The dataset is built using an automated, scalable pipeline applied to in-the-wild internet videos from YouTube-based and Vimeo-based sources. Compared to past datasets, we show that videos from FoleyBench have stronger coverage of sound categories from a taxonomy specifically designed for Foley sound. Each clip is further labeled with metadata capturing source complexity, UCS/AudioSet category, and video length, enabling fine-grained analysis of model performance and failure modes. We benchmark several state-of-the-art V2A models, evaluating them on audio quality, audio-video alignment, temporal synchronization, and audio-text consistency. Samples are available at: https://gclef-cmu.org/foleybench
视频转音频生成(V2A)在电影后期制作、AR/VR和声音设计等领域中的重要性日益增加,特别是在创建与屏幕动作同步的福莱音效效果方面。福莱音效需要生成与可见事件语义上对齐并且与其时间对齐的音频。然而,由于缺乏针对福莱风格场景的基准测试,评估与下游应用之间存在不匹配。我们发现过去评估数据集中的74%的视频存在音视频对应不良的情况。此外,它们主要以语音和音乐为主,这些领域并不适用于福莱音效的使用案例。为了解决这一差距,我们推出了FoleyBench,这是第一个专为福莱风格V2A评估设计的大规模基准测试。FoleyBench包含5000个(视频、地面真实音频、文本字幕)三元组,每个三元组都带有与屏幕事件有因果关系的可见声音源音频。该数据集是使用应用于YouTube和Vimeo等来源的野外互联网视频的自动化可扩展管道构建的。与过去的数据集相比,我们证明了FoleyBench的视频对福莱音效专门设计的分类法中的声音类别具有更强的覆盖性。每个片段都进一步用元数据标记,捕捉源复杂性、UCS/AudioSet类别和视频长度,实现对模型性能和失败模式的精细分析。我们对一些最先进的V2A模型进行了基准测试,评估它们的音质、音视频对齐、时间同步和音频文本一致性。样本可在:https://gclef-cmu.org/foleybench找到。
论文及项目相关链接
摘要
V2A(视频转音频生成)在影视后期制作、AR/VR和声音设计等领域中越来越重要,特别是在创建与屏幕动作同步的音效(如Foley音效)方面。然而,由于缺乏针对Foley场景的基准测试,评估和下游应用之间存在不匹配。我们发现过去评估数据集中有74%的视频存在音视频对应不良的问题,且它们主要集中在语音和音乐领域,并不适用于Foley。为解决此空白,我们引入了专为Foley风格V2A评估设计的首个大规模基准测试——FoleyBench。FoleyBench包含5000个(视频、地面真实音频、文本字幕)三元组,每个三元组都具有与屏幕事件因果相关的可见音源。数据集是使用应用于YouTube和Vimeo等网络视频的自动化可扩展管道构建的。与过去的数据集相比,FoleyBench的视频具有更广泛的从专为Foley设计的分类法中的声音类别覆盖。每个剪辑还带有元数据标签,包括来源复杂性、UCS/AudioSet类别和视频长度等,从而实现模型性能和失败模式的精细分析。我们对几个最先进的V2A模型进行了基准测试,评估它们在音频质量、音视频对齐、时间同步和音频文本一致性方面的表现。样品可在:https://gclef-cmu.org/foleybench获取。
关键见解
- 视频转音频生成(V2A)在多个领域的重要性提升,特别是在创建与屏幕动作同步的音效方面。
- 当前评估数据集存在音视频对应不良的问题,且多数视频集中在非Foley领域(如语音和音乐)。
- 引入FoleyBench:首个专为Foley风格V2A评估设计的基准测试数据集。
- FoleyBench包含大量视频、地面真实音频和文本字幕三元组,强调音视频对应和文本一致性。
- 数据集使用自动化可扩展管道构建,适用于网络视频。
- FoleyBench视频覆盖广泛的Foley音效分类,并带有元数据标签进行精细分析。
点此查看论文截图
Distinguishing Repetition Disfluency from Morphological Reduplication in Bangla ASR Transcripts: A Novel Corpus and Benchmarking Analysis
Authors:Zaara Zabeen Arpa, Sadnam Sakib Apurbo, Nazia Karim Khan Oishee, Ajwad Abrar
Automatic Speech Recognition (ASR) transcripts, especially in low-resource languages like Bangla, contain a critical ambiguity: word-word repetitions can be either Repetition Disfluency (unintentional ASR error/hesitation) or Morphological Reduplication (a deliberate grammatical construct). Standard disfluency correction fails by erroneously deleting valid linguistic information. To solve this, we introduce the first publicly available, 20,000-row Bangla corpus, manually annotated to explicitly distinguish between these two phenomena in noisy ASR transcripts. We benchmark this novel resource using two paradigms: state-of-the-art multilingual Large Language Models (LLMs) and task-specific fine-tuning of encoder models. LLMs achieve competitive performance (up to 82.68% accuracy) with few-shot prompting. However, fine-tuning proves superior, with the language-specific BanglaBERT model achieving the highest accuracy of 84.78% and an F1 score of 0.677. This establishes a strong, linguistically-informed baseline and provides essential data for developing sophisticated, semantic-preserving text normalization systems for Bangla.
自动语音识别(ASR)转录,特别是在孟加拉语等低资源语言中,存在一个关键的模糊性:单词重复可能是重复不流畅(无意的ASR错误/犹豫)或形态复现(故意的语法结构)。标准的流畅性修正会错误地删除有效的语言信息。为解决这一问题,我们引入了首个公开可用的、包含2万行的孟加拉语语料库,通过手动注释来明确区分这两种现象在嘈杂的ASR转录中的区别。我们使用两种范式来评估这一新颖资源:最先进的跨语言大型语言模型(LLM)和针对特定任务的编码器模型微调。LLM通过少量提示即可实现具有竞争力的性能(最高达82.68%的准确率)。然而,微调证明更为优越,针对孟加拉语的特定语言模型BanglaBERT达到最高准确率84.78%,F1分数为0.677。这为孟加拉语的复杂、语义保留文本归一化系统的发展建立了强大且语言信息丰富的基线,并为开发此类系统提供了必要的数据。
论文及项目相关链接
Summary
本文主要介绍了针对孟加拉语自动语音识别(ASR)转录中的重复词现象,区分重复词是重复失误还是语法结构中的形态重复的问题。为解决此问题,研究团队引入了首个公开可用的、手动标注的孟加拉语语料库,该语料库包含二十万行数据。通过采用最新多语言大型语言模型和针对特定任务的编码器模型微调,该语料库的基准测试表现出良好的性能。其中,语言特定的孟加拉BERT模型表现出最佳性能,准确率达到百分之八十四点七十八,F1分数达到零点六七七。这为发展复杂且语义保留的孟加拉语文本规范化系统提供了重要数据。
Key Takeaways
- 自动语音识别(ASR)转录在孟加拉语等低资源语言中存在关键模糊性:词词重复可能是重复失误(ASR误差/犹豫)或形态重复(有意为之的语法结构)。
- 标准流畅性校正方法会错误地删除有效的语言信息。
- 研究团队引入了首个公开可用的、针对ASR转录中两种现象进行区分的孟加拉语语料库,该语料库包含二十万行数据并进行了手动标注。
- 通过采用最新多语言大型语言模型和针对特定任务的编码器模型微调,对该语料库进行了基准测试。
- 多语言大型语言模型表现出良好的性能,但特定任务的微调方法表现更佳。其中,孟加拉BERT模型表现最佳,准确率和F1分数均较高。
- 该研究为开发复杂且语义保留的孟加拉语文本规范化系统提供了重要数据。
点此查看论文截图
Improving Direct Persian-English Speech-to-Speech Translation with Discrete Units and Synthetic Parallel Data
Authors:Sina Rashidi, Hossein Sameti
Direct speech-to-speech translation (S2ST), in which all components are trained jointly, is an attractive alternative to cascaded systems because it offers a simpler pipeline and lower inference latency. However, direct S2ST models require large amounts of parallel speech data in the source and target languages, which are rarely available for low-resource languages such as Persian. This paper presents a direct S2ST system for translating Persian speech into English speech, as well as a pipeline for synthetic parallel Persian-English speech generation. The model comprises three components: (1) a conformer-based encoder, initialized from self-supervised pre-training, maps source speech to high-level acoustic representations; (2) a causal transformer decoder with relative position multi-head attention translates these representations into discrete target speech units; (3) a unit-based neural vocoder generates waveforms from the predicted discrete units. To mitigate the data scarcity problem, we construct a new Persian-English parallel speech corpus by translating Persian speech transcriptions into English using a large language model and then synthesizing the corresponding English speech with a state-of-the-art zero-shot text-to-speech system. The resulting corpus increases the amount of available parallel speech by roughly a factor of six. On the Persian-English portion of the CVSS corpus, the proposed model achieves improvement of 4.6 ASR BLEU with the synthetic data over direct baselines. These results indicate that combining self-supervised pre-training, discrete speech units, and synthetic parallel data is effective for improving direct S2ST in low-resource language pairs such as Persian-English
直接语音识别翻译(S2ST),所有组件都进行联合训练,作为级联系统的替代方案,因其更简单的管道和更低的推理延迟而具有吸引力。然而,直接S2ST模型需要大量源语言和目标语言的并行语音数据,对于波斯语等低资源语言来说,这种数据很少可用。本文提出了一种将波斯语语音翻译成英语语音的直接S2ST系统,以及一个用于生成波斯语-英语并行语音的合成管道。该模型包含三个组件:(1)基于卷积神经网络的编码器,通过自监督预训练进行初始化,将源语音映射到高级声学表示;(2)带有相对位置多头注意力的因果变压器解码器,将这些表示翻译成离散的目标语音单元;(3)基于单元的神经vocoder根据预测的离散单元生成波形。为了缓解数据稀缺问题,我们构建了一个新的波斯语-英语并行语音语料库,通过大型语言模型将波斯语语音转录翻译成英语,然后使用最先进的零样本文本到语音系统合成相应的英语语音。由此产生的语料库使可用的并行语音数量增加了大约六倍。在CVSS语料库的波斯语-英语部分上,与直接基线相比,使用合成数据后,所提出模型的自动语音识别BLEU得分提高了4.6。这些结果表明,结合自监督预训练、离散语音单元和合成并行数据对于改进低资源语言对(如波斯语-英语)的直接S2ST非常有效。
论文及项目相关链接
Summary
本研究提出一种波斯语到英语的直接语音到语音翻译系统(S2ST),并构建合成平行语料库以缓解数据稀缺问题。通过自监督预训练、离散语音单元和合成平行数据相结合,提高低资源语言对(如波斯语-英语)的直接S2ST性能,在CVSS语料库的波斯语-英语部分,使用合成数据较直接基准提高了4.6 ASR BLEU。
Key Takeaways
- 直接语音到语音翻译(S2ST)为级联系统提供简洁替代方案,减少推理延迟。
- S2ST模型需要大量平行语音数据,但在低资源语言如波斯语中难以获取。
- 本研究提出波斯语到英语的直接S2ST系统。
- 系统包括基于conformer的编码器、因果转换器解码器和基于单元的神经vocoder。
- 利用大型语言模型将波斯语语音转录翻译成英语,并用先进的零样本文本到语音系统合成对应英语语音,构建新的波斯语-英语平行语音语料库。
- 合成语料库增加了约六倍的可用平行语音数据。
点此查看论文截图
Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data
Authors:Yunxin Li, Xinyu Chen, Shenyuan Jiang, Haoyuan Shi, Zhenyu Liu, Xuanyu Zhang, Nanhao Deng, Zhenran Xu, Yicheng Ma, Meishan Zhang, Baotian Hu, Min Zhang
We present Uni-MoE 2.0 from the Lychee family. As a fully open-source omnimodal large model (OLM), it substantially advances Lychee’s Uni-MoE series in language-centric multimodal understanding, reasoning, and generating. Based on the Qwen2.5-7B dense architecture, we build Uni-MoE-2.0-Omni from scratch through three core contributions: dynamic-capacity Mixture-of-Experts (MoE) design, a progressive training strategy enhanced with an iterative reinforcement strategy, and a carefully curated multimodal data matching technique. It is capable of omnimodal understanding, as well as generating images, text, and speech. Architecturally, our new MoE framework balances computational efficiency and capability for 10 cross-modal inputs using shared, routed, and null experts, while our Omni-Modality 3D RoPE ensures spatio-temporal cross-modality alignment in the self-attention layer. For training, following cross-modal pretraining, we use a progressive supervised fine-tuning strategy that activates modality-specific experts and is enhanced by balanced data composition and an iterative GSPO-DPO method to stabilise RL training and improve reasoning. Data-wise, the base model, trained on approximately 75B tokens of open-source multimodal data, is equipped with special speech and image generation tokens, allowing it to learn these generative tasks by conditioning its outputs on linguistic cues. Extensive evaluation across 85 benchmarks demonstrates that our model achieves SOTA or highly competitive performance against leading OLMs, surpassing Qwen2.5-Omni (trained with 1.2T tokens) on over 50 of 76 benchmarks. Key strengths include video understanding (+7% avg. of 8), omnimodallity understanding (+7% avg. of 4), and audiovisual reasoning (+4%). It also advances long-form speech processing (reducing WER by 4.2%) and leads in low-level image processing and controllable generation across 5 metrics.
我们推出了Lychee系列的Uni-MoE 2.0。作为一款全开源的跨模态大型模型(OLM),它显著地推进了以语言为中心的跨模态理解、推理和生成方面的Lychee的Uni-MoE系列。基于Qwen2.5-7B密集架构,我们通过三个核心贡献从头构建了Uni-MoE-2.0-Omni:动态容量专家混合(MoE)设计、采用迭代增强策略的渐进训练策略,以及精心策划的跨模态数据匹配技术。它能够进行跨模态理解,并生成图像、文本和语音。从架构上看,我们新的MoE框架在利用共享、路由和空专家处理10个跨模态输入时,实现了计算效率和能力之间的平衡,而我们的Omni-Modality 3D RoPE则确保了自注意力层中的时空跨模态对齐。在训练方面,我们采用跨模态预训练之后的渐进式监督微调策略,该策略能够激活特定模态的专家,并通过平衡数据组合和迭代GSPO-DPO方法来稳定强化学习训练和提升推理能力。从数据角度看,基础模型在大约75B个开源跨模态数据标记上进行训练,配备了特殊的语音和图像生成标记,通过以语言线索为条件来生成这些生成任务。在85个基准测试上的广泛评估表明,我们的模型在领先的OLM中达到了SOTA或极具竞争力的性能表现,在超过一半的基准测试中超过了使用1.2万亿标记训练的Qwen2.5-Omni(在76个基准测试中)。主要优势包括视频理解(+平均提升7%,共评估8项)、跨模态理解(+平均提升7%,共评估4项)和视听推理(+平均提升4%)。此外,它还推进了长语音处理(降低WER 4.2%),并在低级别图像处理和可控生成方面领先了五个指标。
论文及项目相关链接
PDF 47 pages,10 Figures, Project Website: https://idealistxy.github.io/Uni-MoE-v2.github.io/; Codes: https://github.com/HITsz-TMG/Uni-MoE
Summary
基于Qwen2.5-7B密集架构,我们推出了全新的开放式多模态大型模型Uni-MoE 2.0。它采用动态容量的Mixture-of-Experts(MoE)设计,通过渐进训练策略和迭代强化策略,以及精心策划的多模态数据匹配技术,实现了语言为中心的多模态理解、推理和生成能力的显著提升。该模型具备跨模态理解和生成图像、文本、语音的能力。其独特的MoE框架在保证计算效率的同时,能够处理多达十种跨模态输入。经过跨模态预训练后,我们使用渐进的监督和微调策略激活模态特定的专家,并通过平衡数据组合和迭代GSPO-DPO方法来稳定强化学习训练并提高推理能力。此外,该模型经过约75B令牌开源多模态数据训练,并配备了用于语音和图像生成的特殊令牌。评估结果显示,该模型在多项基准测试中达到或超越最新技术水平,尤其在视频理解、多模态理解和视听推理方面表现突出。并在长语音处理、低级别图像处理和可控生成方面领先。
Key Takeaways
- Uni-MoE 2.0是基于Qwen2.5-7B架构的开放式多模态大型模型。
- 通过动态容量的Mixture-of-Experts设计和渐进训练策略等技术提升了模型性能。
- 模型具备跨模态理解和生成能力,包括图像、文本和语音。
- MoE框架在保证计算效率的同时,能处理多种跨模态输入。
- 模型经过大量数据训练,包括约75B令牌的开源多模态数据。
- 评估结果显示,Uni-MoE 2.0在多项基准测试中表现优异,尤其在视频理解、多模态理解和视听推理方面领先。
点此查看论文截图
VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing
Authors:Zhisheng Zheng, Puyuan Peng, Anuj Diwan, Cong Phuoc Huynh, Xiaohang Sun, Zhu Liu, Vimal Bhat, David Harwath
We introduce VoiceCraft-X, an autoregressive neural codec language model which unifies multilingual speech editing and zero-shot Text-to-Speech (TTS) synthesis across 11 languages: English, Mandarin, Korean, Japanese, Spanish, French, German, Dutch, Italian, Portuguese, and Polish. VoiceCraft-X utilizes the Qwen3 large language model for phoneme-free cross-lingual text processing and a novel token reordering mechanism with time-aligned text and speech tokens to handle both tasks as a single sequence generation problem. The model generates high-quality, natural-sounding speech, seamlessly creating new audio or editing existing recordings within one framework. VoiceCraft-X shows robust performance in diverse linguistic settings, even with limited per-language data, underscoring the power of unified autoregressive approaches for advancing complex, real-world multilingual speech applications. Audio samples are available at https://zhishengzheng.com/voicecraft-x/.
我们推出VoiceCraft-X,这是一款自动回归神经编码语言模型,它统一了跨11种语言的多语种语音编辑和零样本文本到语音(TTS)合成,这11种语言包括英语、普通话、韩语、日语、西班牙语、法语、德语、荷兰语、意大利语、葡萄牙语和波兰语。VoiceCraft-X利用Qwen3大型语言模型进行无音素跨语言文本处理,并采用一种新的令牌重排机制,通过文本和语音令牌的时序对齐,将两个任务作为单个序列生成问题来处理。该模型生成高质量、自然语音的语音,可以在一个框架内无缝地创建新的音频或编辑现有的录音。VoiceCraft-X在不同的语言环境中表现出稳健的性能,即使在每语言数据有限的情况下也是如此,这突显出统一自动回归方法在推动复杂、现实的多语种语音应用方面的力量。音频样本可在https://zhishengzheng.com/voicecraft-x/找到。
论文及项目相关链接
PDF EMNLP 2025. Demo and code are available at https://zhishengzheng.com/voicecraft-x/
Summary
语音Craft-X是一个融合了多种语言语音编辑和零样本文本转语音(TTS)合成的自回归神经编码语言模型,支持11种语言。它采用Qwen3大型语言模型进行跨语言文本处理,并采用一种新的令牌重排机制来处理单一序列生成问题。语音Craft-X可以生成高质量的自然语音,无需复杂的操作即可在新的或现有的录音中创建或编辑音频。
Key Takeaways
- VoiceCraft-X是一个统一的多语言语音编辑和零样本TTS合成的自回归神经编码语言模型。
- 支持包括英语、普通话、韩语、日语等在内的11种语言。
- 使用Qwen3大型语言模型进行跨语言文本处理。
- 引入了一种新的令牌重排机制来处理单一序列生成问题。
- 可以生成高质量的自然语音,实现新的音频创建或现有录音的编辑。
- 在不同的语言环境下表现出强大的性能,即使在每种语言的有限数据下也能稳健运行。
点此查看论文截图
Multi-Metric Preference Alignment for Generative Speech Restoration
Authors:Junan Zhang, Xueyao Zhang, Jing Yang, Yuancheng Wang, Fan Fan, Zhizheng Wu
Recent generative models have significantly advanced speech restoration tasks, yet their training objectives often misalign with human perceptual preferences, resulting in suboptimal quality. While post-training alignment has proven effective in other generative domains like text and image generation, its application to generative speech restoration remains largely under-explored. This work investigates the challenges of applying preference-based post-training to this task, focusing on how to define a robust preference signal and curate high-quality data to avoid reward hacking. To address these challenges, we propose a multi-metric preference alignment strategy. We construct a new dataset, GenSR-Pref, comprising 80K preference pairs, where each chosen sample is unanimously favored by a complementary suite of metrics covering perceptual quality, signal fidelity, content consistency, and timbre preservation. This principled approach ensures a holistic preference signal. Applying Direct Preference Optimization (DPO) with our dataset, we observe consistent and significant performance gains across three diverse generative paradigms: autoregressive models (AR), masked generative models (MGM), and flow-matching models (FM) on various restoration benchmarks, in both objective and subjective evaluations. Ablation studies confirm the superiority of our multi-metric strategy over single-metric approaches in mitigating reward hacking. Furthermore, we demonstrate that our aligned models can serve as powerful ‘’data annotators’’, generating high-quality pseudo-labels to serve as a supervision signal for traditional discriminative models in data-scarce scenarios like singing voice restoration. Demo Page:https://gensr-pref.github.io
近期生成模型在语音修复任务上取得了显著进展,但它们的训练目标往往与人类感知偏好不一致,导致质量不佳。虽然基于偏好的后训练对齐在文本和图像生成等其他生成领域已经证明是有效的,但其在生成语音修复中的应用仍然在很大程度上未被探索。本研究探讨了将基于偏好的后训练应用于这一任务所面临的挑战,重点关注如何定义稳健的偏好信号以及筛选高质量数据以避免奖励操纵。为了应对这些挑战,我们提出了一种多指标偏好对齐策略。我们构建了一个新的数据集GenSR-Pref,包含80万对偏好样本,每个被选中的样本都由一套涵盖感知质量、信号保真度、内容一致性和音色保持的互补指标一致认可。这种有原则的方法确保了全面的偏好信号。使用我们的数据集进行直接偏好优化(DPO),我们在三种不同的生成范式中观察到了一致且显著的性能提升:包括自回归模型(AR)、遮罩生成模型(MGM)和流匹配模型(FM)在各种修复基准测试上的表现,在客观和主观评估中均有提升。消融研究证实,我们的多指标策略在缓解奖励操纵方面优于单指标方法。此外,我们证明,我们的对齐模型可以作为强大的“数据注释器”,生成高质量伪标签,作为数据稀缺场景(如歌声修复)中传统判别模型的监督信号。Demo Page:https://gensr-pref.github.io
论文及项目相关链接
PDF Accepted by AAAI 2026. Demopage: https://gensr-pref.github.io
摘要
近期生成模型在语音修复任务中取得显著进展,但其训练目标常与人类感知偏好不一致,导致质量不佳。本文探讨了将基于偏好的后训练对齐应用于此任务所面临的挑战,并聚焦如何定义稳健的偏好信号和筛选高质量数据以避免奖励操纵。为应对这些挑战,提出了一种多指标偏好对齐策略,并构建了一个新数据集GenSR-Pref,包含80K偏好对,其中每个样本都被一系列互补指标一致青睐,涵盖感知质量、信号保真度、内容一致性和音色保持。应用直接偏好优化(DPO)与数据集,观察到三种不同生成范式下性能得到持续和显著的提升。此外,通过废除研究确认多指标策略优于单指标方法,在缓解奖励操纵方面表现优越。最后,证明对齐模型可作为强大的“数据标注器”,生成高质量伪标签,作为数据稀缺场景(如歌声修复)中传统判别模型的监督信号。
关键见解
- 近期生成模型在语音修复任务中取得显著进展,但存在训练目标与人类感知偏好不一致的问题。
- 基于偏好的后训练对齐在语音修复中的应用尚未得到充分探索。
- 提出了一种多指标偏好对齐策略,构建了一个新数据集GenSR-Pref,包含80K偏好对,以定义稳健的偏好信号并避免奖励操纵。
- 应用直接偏好优化(DPO)后,三种不同的生成范式性能得到显著提升。
- 多指标策略优于单指标方法,能有效缓解奖励操纵问题。
- 对齐模型可以作为强大的“数据标注器”,生成高质量伪标签用于监督传统判别模型。
- 该模型在数据稀缺场景(如歌声修复)中具有潜在应用价值。
点此查看论文截图
DualSpeechLM: Towards Unified Speech Understanding and Generation via Dual Speech Token Modeling with Large Language Models
Authors:Yuanyuan Wang, Dongchao Yang, Yiwen Shao, Hangting Chen, Jiankun Zhao, Zhiyong Wu, Helen Meng, Xixin Wu
Extending pre-trained text Large Language Models (LLMs)’s speech understanding or generation abilities by introducing various effective speech tokens has attracted great attention in the speech community. However, building a unified speech understanding and generation model still faces the following challenges: (1) Due to the huge modality gap between speech and text tokens, extending text LLMs to unified speech LLMs relies on large-scale paired data for fine-tuning, and (2) Generation and understanding tasks prefer information at different levels, e.g., generation benefits from detailed acoustic features, while understanding favors high-level semantics. This divergence leads to difficult performance optimization in one unified model. To solve these challenges, in this paper, we present two key insights in speech tokenization and speech language modeling. Specifically, we first propose an Understanding-driven Speech Tokenizer (USTokenizer), which extracts high-level semantic information essential for accomplishing understanding tasks using text LLMs. In this way, USToken enjoys better modality commonality with text, which reduces the difficulty of modality alignment in adapting text LLMs to speech LLMs. Secondly, we present DualSpeechLM, a dual-token modeling framework that concurrently models USToken as input and acoustic token as output within a unified, end-to-end framework, seamlessly integrating speech understanding and generation capabilities. Furthermore, we propose a novel semantic supervision loss and a Chain-of-Condition (CoC) strategy to stabilize model training and enhance speech generation performance. Experimental results demonstrate that our proposed approach effectively fosters a complementary relationship between understanding and generation tasks, highlighting the promising strategy of mutually enhancing both tasks in one unified model.
通过引入各种有效的语音令牌来扩展预训练的文本大型语言模型(LLM)的语音理解或生成能力,在语音领域引起了极大的关注。然而,构建统一的语音理解和生成模型仍然面临以下挑战:(1)由于语音和文本令牌之间存在巨大的模态差距,将文本LLM扩展到统一的语音LLM依赖于大规模配对数据进行微调;(2)生成和理解任务偏好不同层级的信息,例如,生成受益于详细的声学特征,而理解则偏向于高级语义。这种分歧导致在一个统一模型中难以进行性能优化。
论文及项目相关链接
PDF Accepted by AAAI 2026
摘要
扩展预训练文本大型语言模型(LLM)的语音理解或生成能力,通过引入各种有效的语音令牌已经引起了语音社区的广泛关注。然而,构建统一的语音理解和生成模型仍然面临以下挑战:一是由于语音和文本令牌之间存在巨大的模态差距,将文本LLM扩展到统一语音LLM依赖于大规模配对数据进行微调;二是生成和理解任务偏好不同层级的信息,如生成受益于详细的声学特征,而理解则偏好高级语义。这导致在统一模型中优化性能变得困难。为解决这些挑战,本文提出了两个关于语音令牌化和语音语言建模的关键见解。首先,我们提出了一种理解驱动的语音分词器(USTokenizer),它使用文本LLM提取对完成理解任务至关重要的高级语义信息。通过这种方式,USToken与文本的模态共性更好,降低了将文本LLM适应到语音LLM时模态对齐的难度。其次,我们提出了DualSpeechLM,这是一个双令牌建模框架,同时以USToken作为输入和音频令牌作为输出进行建模,在一个统一、端到端的框架中无缝集成语音理解和生成能力。此外,我们提出了一种新型语义监督损失和链式条件(CoC)策略,以稳定模型训练并增强语音生成性能。实验结果表明,我们提出的方法有效地促进了理解与生成任务之间的互补关系,突显出在统一模型中相互增强这两个任务的策略前景。
关键见解
- 语音和文本之间存在巨大的模态差距,扩展文本LLM到统一的语音LLM需要大规模配对数据进行微调。
- 生成和理解任务偏好不同层级的信息,这导致在优化统一模型时面临困难。
- 提出理解驱动的语音分词器(USTokenizer),提取高级语义信息以促进语音理解和生成。
- USToken与文本具有更好的模态共性,降低模态对齐难度。
- 介绍DualSpeechLM框架,统一建模语音理解和生成能力。
- 提出一种新型语义监督损失来稳定模型训练并增强语音生成性能。
- 实验结果显示,提出的方法有效促进理解与生成任务之间的互补关系。
点此查看论文截图
SceneJailEval: A Scenario-Adaptive Multi-Dimensional Framework for Jailbreak Evaluation
Authors:Lai Jiang, Yuekang Li, Xiaohan Zhang, Youtao Ding, Li Pan
Accurate jailbreak evaluation is critical for LLM red team testing and jailbreak research. Mainstream methods rely on binary classification (string matching, toxic text classifiers, and LLM-based methods), outputting only “yes/no” labels without quantifying harm severity. Emerged multi-dimensional frameworks (e.g., Security Violation, Relative Truthfulness and Informativeness) use unified evaluation standards across scenarios, leading to scenario-specific mismatches (e.g., “Relative Truthfulness” is irrelevant to “hate speech”), undermining evaluation accuracy. To address these, we propose SceneJailEval, with key contributions: (1) A pioneering scenario-adaptive multi-dimensional framework for jailbreak evaluation, overcoming the critical “one-size-fits-all” limitation of existing multi-dimensional methods, and boasting robust extensibility to seamlessly adapt to customized or emerging scenarios. (2) A novel 14-scenario dataset featuring rich jailbreak variants and regional cases, addressing the long-standing gap in high-quality, comprehensive benchmarks for scenario-adaptive evaluation. (3) SceneJailEval delivers state-of-the-art performance with an F1 score of 0.917 on our full-scenario dataset (+6% over SOTA) and 0.995 on JBB (+3% over SOTA), breaking through the accuracy bottleneck of existing evaluation methods in heterogeneous scenarios and solidifying its superiority.
精确牢不可破的评估对于大型语言模型红队测试和牢不可破研究至关重要。主流方法依赖于二分类(字符串匹配、有毒文本分类器和基于大型语言模型的方法),只输出“是/否”标签,没有量化伤害严重性。新兴的多维度框架(例如安全违规、相对真实性和信息量)在场景中统一评估标准,导致场景特定不匹配(例如,“相对真实性”与“仇恨言论”无关),从而降低了评估的准确性。为了解决这些问题,我们提出了SceneJailEval,其主要贡献包括:(1)开创性的场景自适应多维度牢不可破评估框架,克服了现有多维度方法的“一刀切”局限性,并具备强大的可扩展性,可无缝适应自定义或新兴场景。(2)包含丰富牢不可破变种和区域案例的14种场景数据集,解决了长期存在的高质量、全面基准测试场景自适应评估的空白。(3)SceneJailEval在我们的全场景数据集上达到了0.917的F1分数(较最新技术高出6%),在JBB上达到了0.995(较最新技术高出3%),突破了现有评估方法在异质场景中的准确性瓶颈,并巩固了其优越性。
论文及项目相关链接
PDF This paper has been accepted by AAAI 2026 as a poster
Summary
文本提出针对LLM红队测试和越狱研究的精确越狱评估至关重要。现有主流方法存在局限性,仅采用二元分类方式输出简单的“是/否”标签,缺乏量化伤害严重程度的评估。新兴的多维度框架虽试图统一评价标准,但由于场景特定不匹配问题,如“相对真实性”与“仇恨言论”无关,导致评估准确性下降。为此,文本提出了SceneJailEval,其关键贡献包括:创新的场景自适应多维度框架进行越狱评估,克服现有多维度方法的“一刀切”局限,并具备轻松适应自定义或新兴场景的鲁棒可扩展性;包含丰富越狱变种和区域案例的14场景数据集,解决了高质量、全面基准测试的场景自适应评估长期缺口问题;SceneJailEval在全面场景数据集上的F1分数达到0.917(较SOTA高出6%),在JBB上达到0.995(较SOTA高出3%),突破现有评估方法在异构场景中的准确性瓶颈,并证明了其优越性。
Key Takeaways
- 准确进行越狱评估对LLM红队测试和越狱研究至关重要。
- 现有评估方法主要依赖二元分类,缺乏量化伤害严重程度的评估。
- 多维度框架试图统一评价标准,但存在场景特定不匹配问题。
- SceneJailEval提出创新的场景自适应多维度框架,克服现有方法的局限。
- SceneJailEval包含14场景数据集,覆盖丰富的越狱变种和区域案例。
- SceneJailEval在全面场景数据集和JBB上表现优异,较SOTA有显著提高。
点此查看论文截图
READ: Real-time and Efficient Asynchronous Diffusion for Audio-driven Talking Head Generation
Authors:Haotian Wang, Yuzhe Weng, Jun Du, Haoran Xu, Xiaoyan Wu, Shan He, Bing Yin, Cong Liu, Jianqing Gao, Qingfeng Liu
The introduction of diffusion models has brought significant advances to the field of audio-driven talking head generation. However, the extremely slow inference speed severely limits the practical implementation of diffusion-based talking head generation models. In this study, we propose READ, a real-time diffusion-transformer-based talking head generation framework. Our approach first learns a spatiotemporal highly compressed video latent space via a temporal VAE, significantly reducing the token count to accelerate generation. To achieve better audio-visual alignment within this compressed latent space, a pre-trained Speech Autoencoder (SpeechAE) is proposed to generate temporally compressed speech latent codes corresponding to the video latent space. These latent representations are then modeled by a carefully designed Audio-to-Video Diffusion Transformer (A2V-DiT) backbone for efficient talking head synthesis. Furthermore, to ensure temporal consistency and accelerated inference in extended generation, we propose a novel asynchronous noise scheduler (ANS) for both the training and inference processes of our framework. The ANS leverages asynchronous add-noise and asynchronous motion-guided generation in the latent space, ensuring consistency in generated video clips. Experimental results demonstrate that READ outperforms state-of-the-art methods by generating competitive talking head videos with significantly reduced runtime, achieving an optimal balance between quality and speed while maintaining robust metric stability in long-time generation.
扩散模型的引入为音频驱动的有声视频头部生成领域带来了重大进展。然而,由于其极慢的推理速度,基于扩散的有声视频头部生成模型的实际应用受到了严重限制。本研究提出了READ,一个基于实时扩散变压器的有声视频头部生成框架。我们的方法首先通过时间VAE学习一个时空高度压缩的视频潜在空间,大大减少令牌计数以加速生成。为了在这个压缩的潜在空间内实现更好的音视频对齐,我们提出了一个预训练的自编码器语音自动编码器(SpeechAE)来生成与视频潜在空间相对应的时间压缩语音潜在代码。这些潜在表示然后通过精心设计的音频到视频扩散变压器(A2V-DiT)主干进行建模,以实现高效的有声视频头部合成。此外,为了确保扩展生成的时序一致性和加速推理,我们为框架的训练和推理过程提出了新型异步噪声调度器(ANS)。ANS利用潜伏空间中的异步添加噪声和异步运动引导生成,确保生成的视频片段的一致性。实验结果表明,READ在生成具有竞争力的有声视频头部视频方面优于最先进的方法,在保持长时间生成稳健的度量稳定性的同时实现了质量和速度的最优平衡,显著减少了运行时间。
论文及项目相关链接
PDF Project page: https://readportrait.github.io/READ/
Summary
扩散模型在音频驱动对话头生成领域取得了显著进展,但其极慢的推理速度限制了实际应用。本研究提出了READ框架,基于实时扩散变压器进行对话头生成。通过时间变分自编码器学习高度压缩的视频潜在空间,显著降低令牌计数以加速生成。为在此压缩潜在空间内实现更好的视听对齐,提出了预训练的语音自编码器(SpeechAE)生成与时间压缩的语音潜在代码相对应的视频潜在空间。这些潜在表示由一个精心设计的音频到视频扩散变压器(A2V-DiT)主干进行建模,以实现有效的对话头合成。此外,为确保时间一致性和加速扩展生成的推理,本研究还提出了用于训练和推理过程的新型异步噪声调度器(ANS)。实验结果证明,READ框架在生成具有竞争力的对话头视频时,实现了质量与速度的平衡,同时在长时间生成中保持了稳健的度量稳定性。
Key Takeaways
- 扩散模型在音频驱动对话头生成中有显著进展。
- 扩散模型的推理速度较慢,限制了实际应用。
- READ框架利用实时扩散变压器进行对话头生成。
- 通过时间变分自编码器学习高度压缩的视频潜在空间以加速生成。
- 预训练的语音自编码器(SpeechAE)用于生成与时间压缩的语音潜在代码相对应的视频潜在空间。
- 异步噪声调度器(ANS)用于训练和推理过程,确保时间一致性和加速扩展生成的推理。
点此查看论文截图
MetaTT: A Global Tensor-Train Adapter for Parameter-Efficient Fine-Tuning
Authors:Javier Lopez-Piqueres, Pranav Deshpande, Archan Ray, Mattia J. Villani, Marco Pistoia, Niraj Kumar
We present MetaTT, a Tensor Train (TT) adapter framework for fine-tuning of pre-trained transformers. MetaTT enables flexible and parameter-efficient model adaptation by using a single shared TT to factorize transformer sub-modules. This factorization indexes key structural dimensions, including layer and matrix type, and can optionally incorporate heads and tasks. This design allows MetaTT’s parameter count to scale with the sum, rather than the product, of the modes, resulting in a substantially more compact adapter. Our benchmarks compare MetaTT with LoRA along with recent state-of-the-art matrix and tensor decomposition based fine-tuning methods. We observe that when tested on single-task standard language modeling benchmarks, MetaTT achieves competitive parameter efficiency to accuracy tradeoff. We further demonstrate that MetaTT performs competitively when compared to state-of-the-art methods on multi-task learning. Finally, we leverage the TT-ansatz to design a rank adaptive optimizer inspired by the DMRG method from many-body physics. Our results demonstrate that integrating this approach with AdamW enhances optimization performance for a specified target rank.
我们提出了MetaTT,这是一个用于预训练变压器精细调整的Tensor Train(TT)适配器框架。MetaTT通过使用单个共享TT对变压器子模块进行分解,从而实现了灵活且参数高效的模型适应。这种分解索引了关键的结构维度,包括层和矩阵类型,并可以选择性地纳入头部和任务。这种设计使得MetaTT的参数计数与模式之和而不是乘积成比例,从而产生了更紧凑的适配器。我们的基准测试将MetaTT与LoRA以及最新的最先进矩阵和基于张量分解的微调方法进行了比较。我们观察到,在单任务标准语言建模基准测试中,MetaTT在参数效率与准确度之间的权衡方面具有竞争力。我们进一步证明,与多任务学习的最新先进技术相比,MetaTT表现良好。最后,我们利用TT方法设计了一种受多体物理学中的DMRG方法启发的秩自适应优化器。我们的结果表明,将这种方法与AdamW相结合,可以提高对指定目标秩的优化性能。
论文及项目相关链接
Summary
MetaTT是一个基于Tensor Train(TT)的适配器框架,用于预训练变压器的微调。它通过单个共享的TT对变压器子模块进行因子分解,实现灵活且参数高效的模型适应。该设计使MetaTT的参数计数随模式之和而不是乘积而扩展,从而产生更紧凑的适配器。在标准语言建模基准测试中,MetaTT在参数效率与准确性之间达到竞争水平。此外,它在多任务学习上与最新技术方法相比具有竞争力。最后,利用TT-ansatz设计了一种受多体物理学DMRG方法启发的秩自适应优化器,与AdamW集成后,可提高针对指定目标秩的优化性能。
Key Takeaways
- MetaTT是一个基于Tensor Train(TT)的适配器框架,用于预训练变压器的灵活微调。
- MetaTT通过因子分解实现参数高效的模型适应,使用单个共享的TT对变压器子模块进行索引。
- MetaTT的参数设计使其更紧凑,参数计数随模式之和扩展,而非乘积。
- 在标准语言建模基准测试中,MetaTT在参数效率与准确性方面表现优异。
- MetaTT在多任务学习上与最新技术方法相比具有竞争力。
- 利用TT-ansatz设计的秩自适应优化器受多体物理学DMRG方法启发。
点此查看论文截图
Lina-Speech: Gated Linear Attention and Initial-State Tuning for Multi-Sample Prompting Text-To-Speech Synthesis
Authors:Théodor Lemerle, Téo Guichoux, Axel Roebel, Nicolas Obin
Neural codec language models, built on transformer architecture, have revolutionized text-to-speech (TTS) synthesis, excelling in voice cloning by treating it as a prefix continuation task. However, their limited context length hinders their effectiveness to short speech samples. As a result, the voice cloning ability is restricted to a limited coverage and diversity of the speaker’s prosody and style. Besides, adapting prosody, accent, or appropriate emotion from a short prefix remains a challenging task. Finally, the quadratic complexity of self-attention limits inference throughput. In this work, we introduce Lina-Speech, a TTS model with Gated Linear Attention (GLA) to replace standard self-attention as a principled backbone, improving inference throughput while matching state-of-the-art performance. Leveraging the stateful property of recurrent architecture, we introduce an Initial-State Tuning (IST) strategy that unlocks the possibility of multiple speech sample conditioning of arbitrary numbers and lengths and provides a comprehensive and efficient strategy for voice cloning and out-of-domain speaking style and emotion adaptation. We demonstrate the effectiveness of this approach for controlling fine-grained characteristics such as prosody and emotion. Code, checkpoints, and demo are freely available: https://github.com/theodorblackbird/lina-speech
基于神经网络编码器的语言模型,建立在Transformer架构之上,已经彻底改变了文本到语音(TTS)的合成方式,通过将语音克隆视为前缀延续任务,其在语音克隆方面表现出卓越的性能。然而,其有限的上下文长度限制了它们在短语音样本中的有效性。因此,语音克隆能力仅限于对说话人的语调和风格的有限覆盖和多样性。此外,从简短的前缀中适应语调、口音或适当的情绪仍然是一项具有挑战性的任务。最后,自注意力的二次复杂性限制了推理吞吐量。在这项工作中,我们介绍了Lina-Speech,这是一种带有门控线性注意力(GLA)的TTS模型,它替代了标准的自注意力作为核心架构,提高了推理吞吐量,同时匹配了最先进的技术性能。我们利用递归架构的stateful属性,引入了一种初始状态调整(IST)策略,这种策略开启了任意数量和长度的多个语音样本条件的可能性,并为语音克隆以及域外说话风格和情感适应提供了全面而有效的策略。我们证明了这种方法在控制细微特征(如语调和情感)方面的有效性。相关代码、检查点和演示可免费获取:https://github.com/theodorblackbird/lina-speech
论文及项目相关链接
PDF Audio-AAAI Workshop, 2026
Summary
神经网络编码语言模型在文本转语音(TTS)合成领域实现了革命性的突破,尤其在语音克隆方面表现出卓越性能。然而,其有限的上下文长度限制了短语音样本的有效性,导致语音克隆能力在说话者的语调和风格方面的覆盖范围和多样性有限。此外,从短前缀中适应语调、口音或适当的情绪仍然是一项具有挑战性的任务。本文介绍了一种名为Lina-Speech的TTS模型,采用门控线性注意力(GLA)替代标准自注意力机制,提高了推理速度并保持了最新性能。我们引入了一种初始状态调整(IST)策略,利用递归架构的状态特性,实现了任意数量和长度的多个语音样本条件化,为语音克隆和跨域语音风格和情感适应提供了全面有效的策略。
Key Takeaways
- 神经网络编码语言模型在TTS领域的语音克隆方面表现出卓越性能。
- 有限的上下文长度限制了短语音样本的有效性,影响语音克隆的覆盖和多样性。
- 适应语调、口音和情绪从短前缀中仍然具有挑战性。
- Lina-Speech模型采用门控线性注意力(GLA)提高推理速度并保持最新性能。
- 初始状态调整(IST)策略利用递归架构的状态特性,实现任意数量和长度的语音样本条件化。
- Lina-Speech为语音克隆和跨域语音风格及情感适应提供了全面有效的策略。