嘘~ 正在从服务器偷取页面 . . .

TTS


⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验

2025-11-05 更新

Two-Timescale Optimization Framework for IAB-Enabled Heterogeneous UAV Networks

Authors:Jikang Deng, Hui Zhou, Mohamed-Slim Alouini

In post-disaster scenarios, the rapid deployment of adequate communication infrastructure is essential to support disaster search, rescue, and recovery operations. To achieve this, uncrewed aerial vehicle (UAV) has emerged as a promising solution for emergency communication due to its low cost and deployment flexibility. However, conventional untethered UAV (U-UAV) is constrained by size, weight, and power (SWaP) limitations, making it incapable of maintaining the operation of a macro base station. To address this limitation, we propose a heterogeneous UAV-based framework that integrates tethered UAV (T-UAV) and U-UAVs, where U-UAVs are utilized to enhance the throughput of cell-edge ground user equipments (G-UEs) and guarantee seamless connectivity during G-UEs’ mobility to safe zones. It is noted that the integrated access and backhaul (IAB) technique is adopted to support the wireless backhaul of U-UAVs. Accordingly, we formulate a two-timescale joint user scheduling and trajectory control optimization problem, aiming to maximize the downlink throughput under asymmetric traffic demands and G-UEs’ mobility. To solve the formulated problem, we proposed a two-timescale multi-agent deep deterministic policy gradient (TTS-MADDPG) algorithm based on the centralized training and distributed execution paradigm. Numerical results show that the proposed algorithm outperforms other benchmarks, including the two-timescale multi-agent proximal policy optimization (TTS-MAPPO) algorithm and MADDPG scheduling method, with robust and higher throughput. Specifically, the proposed algorithm obtains up to 12.2% average throughput gain compared to the MADDPG scheduling method.

在灾后场景中,迅速部署足够的通信基础设施对于支持灾难搜索、救援和恢复行动至关重要。为此,由于无人机(UAV)成本低且部署灵活,它已成为应急通信的有前途的解决方案。然而,传统的无绳无人机(U-UAV)受到大小、重量和功率(SWaP)限制的限制,无法维持宏基站的操作。为了解决这一限制,我们提出了一种基于异构无人机的框架,该框架结合了系留无人机(T-UAV)和U-UAV,其中U-UAV用于增强边缘地面用户设备(G-UE)的吞吐量,并保证在G-UE移动到安全区域时的无缝连接。值得注意的是,采用集成接入和回程(IAB)技术来支持U-UAV的无线回程。因此,我们制定了一个两时间尺度联合用户调度和轨迹控制优化问题,旨在在对称交通需求和G-UE移动性的情况下最大化下行链路吞吐量。为了解决所制定的问题,我们提出了一种基于集中训练和分布式执行范式的两时间尺度多智能体深度确定性策略梯度(TTS-MADDPG)算法。数值结果表明,所提算法优于其他基准算法,包括两时间尺度多智能体近端策略优化(TTS-MAPPO)算法和MADDPG调度方法,具有更强大和更高的吞吐量。具体来说,所提算法与MADDPG调度方法相比,平均吞吐量提高了高达12.2%。

论文及项目相关链接

PDF

摘要

灾后场景中,迅速部署足够的通信基础设施对于支持灾难搜索、救援和恢复行动至关重要。无人机因其低成本和部署灵活性而成为应急通信的有前途的解决方案。然而,传统无人束缚无人机(U-UAV)受到大小、重量和电力(SWaP)的限制,无法维持宏基站的操作。为解决此限制,我们提出一种基于异构无人机的框架,该框架结合了束缚无人机(T-UAV)和U-UAVs,利用U-UAVs增强边缘地面用户设备(G-UEs)的吞吐量,并保证在G-UEs向安全区域移动时的无缝连接。采用集成接入和回传(IAB)技术来支持U-UAVs的无线回传。据此,我们建立了两时间尺度联合用户调度和轨迹控制优化问题,旨在最大化不对称交通需求下G-UEs的移动性下的下行链路吞吐量。为了解决建立的问题,我们提出了一种基于集中训练和分布式执行范式的两时间尺度多智能体深度确定性策略梯度(TTS-MADDPG)算法。数值结果表明,所提出的算法优于其他基准测试,包括两时间尺度多智能体近端策略优化(TTS-MAPPO)算法和马匹政策梯度调度方法,具有稳健性和更高的吞吐量。具体来说,该算法与马匹政策梯度调度方法相比,平均吞吐量提高了高达12.2%。

关键见解

  1. 在灾后场景中,迅速部署通信基础设施对灾难搜索、救援和恢复操作至关重要。
  2. 无人机因低成本和部署灵活性成为应急通信解决方案。
  3. 传统无人束缚无人机(U-UAV)受到SWaP限制,无法维持宏基站操作。
  4. 提出一种基于异构无人机的框架,结合束缚无人机(T-UAV)和U-UAVs,保证无缝连接并增强边缘地面用户设备的吞吐量。
  5. 采用集成接入和回传(IAB)技术支持U-UAVs的无线回传。
  6. 建立两时间尺度联合用户调度和轨迹控制优化问题,旨在最大化下行链路吞吐量。

Cool Papers

点此查看论文截图

SP-MCQA: Evaluating Intelligibility of TTS Beyond the Word Level

Authors:Hitomi Jin Ling Tee, Chaoren Wang, Zijie Zhang, Zhizheng Wu

The evaluation of intelligibility for TTS has reached a bottleneck, as existing assessments heavily rely on word-by-word accuracy metrics such as WER, which fail to capture the complexity of real-world speech or reflect human comprehension needs. To address this, we propose Spoken-Passage Multiple-Choice Question Answering, a novel subjective approach evaluating the accuracy of key information in synthesized speech, and release SP-MCQA-Eval, an 8.76-hour news-style benchmark dataset for SP-MCQA evaluation. Our experiments reveal that low WER does not necessarily guarantee high key-information accuracy, exposing a gap between traditional metrics and practical intelligibility. SP-MCQA shows that even state-of-the-art (SOTA) models still lack robust text normalization and phonetic accuracy. This work underscores the urgent need for high-level, more life-like evaluation criteria now that many systems already excel at WER yet may fall short on real-world intelligibility.

对于TTS的可懂度评估已经遇到了瓶颈,因为现有的评估严重依赖于诸如WER的字对字准确率指标,这些指标无法捕捉现实世界中语音的复杂性或反映人类的理解需求。为了解决这一问题,我们提出了Spoken-Passage Multiple-Choice Question Answering(SP-MCQA),这是一种新的主观方法,用于评估合成语音中的关键信息准确性,并发布了SP-MCQA-Eval,这是一个用于SP-MCQA评估的8.76小时新闻风格的基准数据集。我们的实验表明,低WER并不一定保证高关键信息准确率,这暴露了传统指标与实际可懂度之间的鸿沟。SP-MCQA表明,即使是目前最先进的技术模型仍然缺乏稳健的文本归一化和语音准确性。这项工作强调了高级别、更贴近现实生活的评估标准的迫切需求,因为现在许多系统在WER上已经表现出色,但在实际可懂度上可能仍有不足。

论文及项目相关链接

PDF

Summary

本文提出TTS(文本转语音)的可懂度评估存在瓶颈,现有的评估方式主要依赖于字词准确率(WER)等单一指标,无法全面反映真实语境下的语音复杂性及人类的理解需求。为解决这一问题,作者提出了一种新的主观评估方法——Spoken-Passage Multiple-Choice Question Answering(SP-MCQA),并据此构建了SP-MCQA-Eval数据集。实验表明,低WER并不等同于高关键信息准确率,凸显了传统评估指标与实际可懂度之间的鸿沟。此外,SP-MCQA显示即便最先进的模型仍面临文本归一化和语音准确性的挑战。本文强调,随着许多系统已在WER方面表现出色,却在实际可懂度上表现欠佳,现在亟需更贴近实际生活的高级评估标准。

Key Takeaways

  1. TTS的可懂度评估存在瓶颈,因为现有的评估方法过于依赖字词准确率(WER)指标,难以全面反映真实语境下的语音复杂性。
  2. SP-MCQA是一种新型的TTS评估方法,通过构建选择题来测试合成语音中的关键信息准确性。
  3. SP-MCQA-Eval数据集被发布用于SP-MCQA的评估,包含8.76小时新闻风格的文本。
  4. 实验显示,低WER并不保证高关键信息准确率,表明传统评估指标与实际可懂度之间存在差距。
  5. SP-MCQA揭示了即使是最先进的TTS模型在文本归一化和语音准确性方面仍存在挑战。
  6. 现有的TTS系统在真实环境中的表现需要更高级、更贴近实际生活的评估标准。

Cool Papers

点此查看论文截图

Reasoning Path Divergence: A New Metric and Curation Strategy to Unlock LLM Diverse Thinking

Authors:Feng Ju, Zeyu Qin, Rui Min, Zhitao He, Lingpeng Kong, Yi R. Fung

While Test-Time Scaling (TTS) has proven effective in improving the reasoning ability of large language models (LLMs), low diversity in model outputs often becomes a bottleneck; this is partly caused by the common “one problem, one solution” (1P1S) training practice, which provides a single canonical answer and can push models toward a narrow set of reasoning paths. To address this, we propose a “one problem, multiple solutions” (1PNS) training paradigm that exposes the model to a variety of valid reasoning trajectories and thus increases inference diversity. A core challenge for 1PNS is reliably measuring semantic differences between multi-step chains of thought, so we introduce Reasoning Path Divergence (RPD), a step-level metric that aligns and scores Long Chain-of-Thought solutions to capture differences in intermediate reasoning. Using RPD, we curate maximally diverse solution sets per problem and fine-tune Qwen3-4B-Base. Experiments show that RPD-selected training yields more varied outputs and higher pass@k, with an average +2.80% gain in pass@16 over a strong 1P1S baseline and a +4.99% gain on AIME24, demonstrating that 1PNS further amplifies the effectiveness of TTS. Our code is available at https://github.com/fengjujf/Reasoning-Path-Divergence .

尽管测试时缩放(TTS)在提高大型语言模型(LLM)的推理能力方面已证明其有效性,但模型输出的低多样性常常成为瓶颈;这在一定程度上是由于常见的“一个问题,一个解决方案”(1P1S)的训练实践导致的,它提供了单一的规范答案,并可能使模型走向狭窄的推理路径。为了解决这一问题,我们提出了“一个问题,多种解决方案”(1PNS)的训练范式,该范式使模型接触到各种有效的推理轨迹,从而增加了推理的多样性。对于1PNS的核心挑战在于可靠地衡量多步骤思维链之间的语义差异,因此我们引入了推理路径发散度(RPD),这是一种步骤级指标,用于对齐和评估长思维链解决方案,以捕捉中间推理中的差异。使用RPD,我们针对每个问题创建最大多样化的解决方案集,并微调Qwen3-4B-Base。实验表明,RPD选择的训练产生了更多样化的输出和更高的pass@k,在强大的1P1S基准测试上,pass@16平均提高了+2.80%,在AIME24上提高了+4.99%,这表明1PNS进一步放大了TTS的有效性。我们的代码可在https://github.com/fengjujf/Reasoning-Path-Divergence上找到。

论文及项目相关链接

PDF

Summary

本文提出一种名为“一个问题,多种解决方案”(1PNS)的训练范式,以解决测试时间缩放(TTS)在提高大型语言模型(LLM)推理能力时遇到的模型输出多样性低的问题。该范式通过暴露模型于多种有效的推理路径来增加推理多样性。为应对1PNS的核心挑战,即测量多步推理路径之间的语义差异,文章引入了Reasoning Path Divergence(RPD)这一逐步度量指标,用于捕捉中间推理的差异。实验表明,使用RPD选择的训练产出更多样化的输出,并在pass@k上取得更高的成绩。

Key Takeaways

  1. 测试时间缩放(TTS)在提升大型语言模型(LLM)的推理能力时面临模型输出多样性低的问题。
  2. 现有“一个问题,一个解决方案”(1P1S)的训练实践是导致这一问题的部分原因。
  3. 提出“一个问题,多种解决方案”(1PNS)的训练范式,通过暴露模型于多种有效的推理路径来增加推理多样性。
  4. 引入Reasoning Path Divergence(RPD)这一逐步度量指标,用于在多种解决方案中测量语义差异。
  5. 使用RPD选择的训练产出更多样化的输出,并在pass@k指标上表现更优。
  6. 相较于传统的1P1S训练基准,使用RPD训练的模型在pass@16上平均提升了+2.80%的性能。
  7. 该研究的代码已公开在GitHub上。

Cool Papers

点此查看论文截图

Authors:Davide Romano, Jonathan Schwarz, Daniele Giofré

Test-time scaling (TTS) techniques can improve the performance of large language models (LLMs) at the expense of additional computation and latency. While TTS has proven effective in formal domains such as mathematics and programming, its value in argumentative domains such as law remains underexplored. We present an empirical study of verifier-based TTS methods for legal multiple-choice QA (MCQA) across five benchmarks. Using a family of 7 reward models, we evaluate both outcome-level (Best-of-$N$) and process-level (tree search) verification under realistic low-$N$ budgets. Our analysis systematically investigates how verifier utility is affected by key properties such as domain specialization, model size, and supervision type (process-supervised PRMs vs. outcome-only ORMs), even when applied across different roles.

测试时缩放(TTS)技术可以在牺牲计算量和延迟的情况下提高大型语言模型(LLM)的性能。虽然TTS在正式领域(如数学和编程)中已被证明是有效的,但其对法律等论证领域的价值尚未得到充分探索。我们对基于验证器的TTS方法在五个基准测试上的法律多项选择题问答(MCQA)进行了实证研究。使用7个奖励模型家族,我们在现实的低N预算下评估了结果级(Best-of-$N$)和过程级(树搜索)的验证。我们的分析系统地研究了验证器效用如何受到领域专业化、模型大小和监督类型(过程监督的PRMs与仅结果的ORMs)等关键属性的影响,即使这些属性在不同的角色中应用也是如此。

论文及项目相关链接

PDF Accepted to EMNLP - NLLP Workshop

Summary

测试时间缩放(TTS)技术可以提高大型语言模型(LLM)的性能,但会带来额外的计算和延迟成本。尽管TTS在正式领域(如数学和编程)中已被证明是有效的,但其在论证领域(如法律)的价值仍被低估。本文实证研究了基于验证器的TTS方法在法律多项选择题问答(MCQA)中的应用,涵盖了五个基准测试。通过家族中7种奖励模型,我们在现实的低N预算下评估了结果级别(Best-of-N)和过程级别(树搜索)的验证。我们的分析系统地研究了验证器效用如何受到领域专业化、模型大小和监督类型(过程监督的PRMs与仅结果的ORMs)等关键属性的影响,即使在不同的角色中应用也是如此。

Key Takeaways

  1. TTS技术可以提高大型语言模型在正式领域的性能,但在法律等论证领域的价值尚未得到充分探索。
  2. 本文通过实证研究了基于验证器的TTS方法在法律多项选择题问答(MCQA)的应用效果。
  3. 研究使用了7种奖励模型来评估验证器在结果级别和过程级别的表现。
  4. 在低N预算的现实条件下进行了评估,研究了验证器效用受到的关键属性,包括领域专业化、模型大小和不同的监督类型。
  5. 研究发现验证器效用受到领域专业化等因素的影响,即使是应用于不同的角色也依然如此。
  6. 测试发现不同奖励模型对验证器性能的影响显著,这为未来的研究提供了方向。

Cool Papers

点此查看论文截图

Bayesian Speech synthesizers Can Learn from Multiple Teachers

Authors:Ziyang Zhang, Yifan Gao, Xuenan Xu, Baoxiangli, Wen Wu, Chao Zhang

Codec-based text-to-speech (TTS) models have recently gained traction for their efficiency and strong performance in voice cloning. However, codec-based TTS faces limitations due to the challenges of pretraining robust speech codecs and the quality degradation introduced by quantization errors. Emerging evidence suggests that continuous-valued generative models can alleviate these issues and serve as a promising alternative. Yet, effectively modelling diverse speech patterns and developing reliable sampling strategies for continuous-valued autoregressive (AR) TTS remains underexplored. In this work, we propose BELLE, Bayesian evidential learning with language modelling for TTS, a novel continuous-valued AR framework that directly predicts mel-spectrograms from textual input. BELLE treats each mel-spectrogram frame as a Gaussian distribution sampled from a learned hyper distribution, enabling principled uncertainty estimation, particularly in scenarios with parallel data (i.e., one text-audio prompt paired with multiple speech samples). To obtain such data, diverse speech samples are synthesized using multiple pre-trained TTS models given the same text-audio prompts, which are distilled into BELLE via Bayesian evidential learning. Experimental results indicate that BELLE demonstrates highly competitive performance compared with the current best open-source TTS models, even though BELLE is trained on a large amount of synthetic data and uses only approximately one-tenth of their training data. Audio samples generated by BELLE are available at https://belletts.github.io/Belle/. The code, checkpoints, and synthetic data will be released after the paper is accepted.

基于编码器的文本到语音(TTS)模型由于其效率和在语音克隆中的出色表现而最近备受关注。然而,基于编码器的TTS面临着由于预训练稳健语音编码器的挑战以及量化误差引入的质量下降问题。最新证据表明,连续值生成模型可以缓解这些问题,并作为有前途的替代方案。然而,对于基于连续值的自回归(AR)TTS,有效建模多样的语音模式和发展可靠的采样策略仍然缺乏足够的探索。在这项工作中,我们提出了BELLE,一个用于TTS的基于贝叶斯证据学习的语言建模方法,这是一种新型连续值AR框架,可直接从文本输入预测梅尔频谱图。BELLE将每个梅尔频谱图帧视为从学习到的超分布采样的高斯分布,这有助于进行有原则的不确定性估计,特别是在有并行数据的场景中(即一个文本-音频提示与多个语音样本配对)。为了获得此类数据,使用多个预训练的TTS模型对相同的文本-音频提示进行语音样本合成,然后通过贝叶斯证据学习将知识蒸馏到BELLE中。实验结果表明,尽管BELLE是在大量合成数据上训练的,并且仅使用了当前最佳开源TTS模型十分之一左右的数据量,但它仍然表现出极具竞争力的性能。BELLE生成的音频样本可在[https://belletts.github.io/Belle/]上找到。论文被接受后,代码、检查点和合成数据将发布。

论文及项目相关链接

PDF

摘要

基于编码器的文本到语音(TTS)模型因其效率及其在语音克隆中的出色表现而备受关注。然而,由于预训练稳健语音编码器的挑战以及量化误差引起的质量下降,基于编码器的TTS面临限制。新兴证据表明,连续值生成模型可以缓解这些问题并成为一种有前途的替代方案。然而,对于连续值自回归(AR)TTS,有效建模多样的语音模式和开发可靠的采样策略仍然未被充分探索。在本研究中,我们提出了BELLE,一种用于TTS的基于贝叶斯证据学习与语言建模的连续值AR框架。BELLE直接从文本输入预测梅尔频谱图。BELLE将每个梅尔频谱图帧视为从学习到的超分布中提取的高斯分布样本,这在并行数据场景中能够实现原则性的不确定性估计(即一个文本-音频提示与多个语音样本配对)。为了获取此类数据,使用多个预训练的TTS模型对相同的文本-音频提示进行合成多样的语音样本,然后通过贝叶斯证据学习蒸馏到BELLE中。实验结果表明,即使在训练数据仅为当前最佳开源TTS模型的十分之一的情况下,BELLE也表现出极具竞争力的性能。BELLE生成的音频样本可在[https://belletts.github.io/Belle/]访问。论文被接受后,将发布代码、检查点和合成数据。

关键见解

  1. 编码器的文本到语音(TTS)模型因其效率和语音克隆性能而受到关注。
  2. 基于编码器的TTS面临预训练稳健语音编码器的挑战和量化误差引起的质量下降问题。
  3. 连续值生成模型作为一种有前途的替代方案来缓解这些问题。
  4. 提出了一种新型的连续值AR框架BELLE,其能直接预测梅尔频谱图并处理文本输入。
  5. BELLE通过采用贝叶斯证据学习从文本-音频提示中合成多样的语音样本。
  6. 实验结果表明,BELLE的性能与当前最佳开源TTS模型相比极具竞争力,尽管其使用了较少的训练数据。

Cool Papers

点此查看论文截图

SoulX-Podcast: Towards Realistic Long-form Podcasts with Dialectal and Paralinguistic Diversity

Authors:Hanke Xie, Haopeng Lin, Wenxiao Cao, Dake Guo, Wenjie Tian, Jun Wu, Hanlin Wen, Ruixuan Shang, Hongmei Liu, Zhiqi Jiang, Yuepeng Jiang, Wenxi Chen, Ruiqi Yan, Jiale Qian, Yichao Yan, Shunshun Yin, Ming Tao, Xie Chen, Lei Xie, Xinsheng Wang

Recent advances in text-to-speech (TTS) synthesis have significantly improved speech expressiveness and naturalness. However, most existing systems are tailored for single-speaker synthesis and fall short in generating coherent multi-speaker conversational speech. This technical report presents SoulX-Podcast, a system designed for podcast-style multi-turn, multi-speaker dialogic speech generation, while also achieving state-of-the-art performance in conventional TTS tasks. To meet the higher naturalness demands of multi-turn spoken dialogue, SoulX-Podcast integrates a range of paralinguistic controls and supports both Mandarin and English, as well as several Chinese dialects, including Sichuanese, Henanese, and Cantonese, enabling more personalized podcast-style speech generation. Experimental results demonstrate that SoulX-Podcast can continuously produce over 90 minutes of conversation with stable speaker timbre and smooth speaker transitions. Moreover, speakers exhibit contextually adaptive prosody, reflecting natural rhythm and intonation changes as dialogues progress. Across multiple evaluation metrics, SoulX-Podcast achieves state-of-the-art performance in both monologue TTS and multi-turn conversational speech synthesis.

近期文本到语音(TTS)合成技术的进展极大地提高了语音的表情达和自然度。然而,大多数现有系统都是针对单演讲者合成的,而在生成连贯的多演讲者对话语音方面存在不足。本技术报告介绍了SoulX-Podcast系统,它是专为播客风格的多轮多演讲者对话语音生成而设计的,同时在常规TTS任务中实现了最先进的性能。为了满足多轮对话语音更高的自然度要求,SoulX-Podcast集成了各种副语言控制,并支持普通话和英语,以及多种中文方言,包括四川话、湖南话和粤语,从而实现更加个性化的播客风格语音生成。实验结果表明,SoulX-Podcast能够连续输出超过90分钟的对话,语音的演讲者音色稳定,语音过渡流畅。此外,演讲者的韵律能够适应该上下文,反映对话进行中的自然节奏和语调变化。在多个评估指标上,SoulX-Podcast在独白TTS和多轮对话语音合成中都实现了最先进的性能。

论文及项目相关链接

PDF

Summary
文本介绍了一种针对多人对话形式的Podcast的TTS合成系统——SoulX-Podcast。它不仅实现了传统TTS任务的最新性能,还可以进行多人对话语音生成,满足不同语种和方言的需求。实验结果表明,该系统可以连续产生超过90分钟的对话,并稳定地表现出说话人的音色和流畅的说话人转换。该系统具有上下文自适应的韵律,可以反映对话进程中的自然节奏和语调变化。总体上,SoulX-Podcast在独白TTS和多轮对话语音合成方面都达到了最新的性能水平。

Key Takeaways

  1. SoulX-Podcast系统设计用于Podcast风格的多轮多说话者对话语音生成。
  2. 集成多种语言控制,支持普通话、英语及多种中文方言。
  3. 能够连续产生超过90分钟的对话,稳定表现说话人的音色和流畅转换。
  4. 说话人展现出上下文自适应的韵律。
  5. 在独白TTS和多轮对话语音合成方面达到最新性能水平。
  6. 实验结果证明该系统在自然性和连贯性方面表现出色。

Cool Papers

点此查看论文截图

UltraVoice: Scaling Fine-Grained Style-Controlled Speech Conversations for Spoken Dialogue Models

Authors:Wenming Tu, Guanrou Yang, Ruiqi Yan, Wenxi Chen, Ziyang Ma, Yipeng Kang, Kai Yu, Xie Chen, Zilong Zheng

Spoken dialogue models currently lack the ability for fine-grained speech style control, a critical capability for human-like interaction that is often overlooked in favor of purely functional capabilities like reasoning and question answering. To address this limitation, we introduce UltraVoice, the first large-scale speech dialogue dataset engineered for multiple fine-grained speech style control. Encompassing over 830 hours of speech dialogues, UltraVoice provides instructions across six key speech stylistic dimensions: emotion, speed, volume, accent, language, and composite styles. Fine-tuning leading models such as SLAM-Omni and VocalNet on UltraVoice significantly enhances their fine-grained speech stylistic controllability without degrading core conversational abilities. Specifically, our fine-tuned models achieve improvements of 29.12-42.33% in Mean Opinion Score (MOS) and 14.61-40.09 percentage points in Instruction Following Rate (IFR) on multi-dimensional control tasks designed in the UltraVoice. Moreover, on the URO-Bench benchmark, our fine-tuned models demonstrate substantial gains in core understanding, reasoning, and conversational abilities, with average improvements of +10.84% on the Basic setting and +7.87% on the Pro setting. Furthermore, the dataset’s utility extends to training controllable Text-to-Speech (TTS) models, underscoring its high quality and broad applicability for expressive speech synthesis. The complete dataset and model checkpoints are available at: https://github.com/bigai-nlco/UltraVoice.

当前,语音对话模型缺乏精细粒度的语音风格控制能力,这对于类似人类的交互至关重要,通常被忽视,取而代之的是诸如推理和问答等纯功能性的能力。为了解决这个问题,我们推出了UltraVoice,这是首个面向多种精细粒度语音风格控制的大型语音对话数据集。UltraVoice包含了超过830小时的语音对话,提供了六个关键语音风格维度的指令:情感、语速、音量、口音、语言和组合风格。使用UltraVoice对SLAM-Omni和VocalNet等领先模型进行微调,显著提高了其精细粒度的语音风格控制能力,而不会降低其核心对话能力。具体来说,我们在UltraVoice中设计的多维度控制任务上,微调后的模型在平均意见得分(MOS)上提高了29.12%~42.33%,在指令遵循率(IFR)上提高了14.61%~40.09个百分点。此外,在URO-Bench基准测试上,微调后的模型在核心理解、推理和对话能力方面表现出显著的优势,基本设置上平均提高了+10.84%,专业设置上平均提高了+7.87%。此外,该数据集还适用于训练可控的文本到语音(TTS)模型,凸显了其高质量和广泛的适用性,可用于表达性语音合成。完整的数据集和模型检查点可访问于:https://github.com/bigai-nlco/UltraVoice

论文及项目相关链接

PDF 23 pages, 4 figures

Summary

UltraVoice是一款针对多种精细语音风格控制的大型语音对话数据集。它包含超过830小时的语音对话,涵盖了六种关键语音风格维度,如情感、语速、音量、口音、语言和复合风格。该数据集可显著提升SLAM-Omni和VocalNet等领先模型的精细语音风格控制能力,同时不损害其核心对话能力。此外,UltraVoice数据集还可用于训练可控文本转语音(TTS)模型,展现出其高质量和广泛适用性。

Key Takeaways

  1. UltraVoice是首个针对多种精细语音风格控制的大型语音对话数据集。
  2. 包含超过830小时的语音对话,覆盖六种关键语音风格维度。
  3. 通过UltraVoice数据集,可以显著提升SLAM-Omni和VocalNet等模型的精细语音风格控制能力。
  4. 在多维度控制任务上,fine-tuned模型在Mean Opinion Score (MOS)和Instruction Following Rate (IFR)方面取得显著改进。
  5. UltraVoice数据集有助于提高核心理解、推理和对话能力,特别是在URO-Bench基准测试上表现优异。
  6. 数据集可用于训练可控文本转语音(TTS)模型,显示出其高质量和广泛适用性。

Cool Papers

点此查看论文截图

T2SMark: Balancing Robustness and Diversity in Noise-as-Watermark for Diffusion Models

Authors:Jindong Yang, Han Fang, Weiming Zhang, Nenghai Yu, Kejiang Chen

Diffusion models have advanced rapidly in recent years, producing high-fidelity images while raising concerns about intellectual property protection and the misuse of generative AI. Image watermarking for diffusion models, particularly Noise-as-Watermark (NaW) methods, encode watermark as specific standard Gaussian noise vector for image generation, embedding the infomation seamlessly while maintaining image quality. For detection, the generation process is inverted to recover the initial noise vector containing the watermark before extraction. However, existing NaW methods struggle to balance watermark robustness with generation diversity. Some methods achieve strong robustness by heavily constraining initial noise sampling, which degrades user experience, while others preserve diversity but prove too fragile for real-world deployment. To address this issue, we propose T2SMark, a two-stage watermarking scheme based on Tail-Truncated Sampling (TTS). Unlike prior methods that simply map bits to positive or negative values, TTS enhances robustness by embedding bits exclusively in the reliable tail regions while randomly sampling the central zone to preserve the latent distribution. Our two-stage framework then ensures sampling diversity by integrating a randomly generated session key into both encryption pipelines. We evaluate T2SMark on diffusion models with both U-Net and DiT backbones. Extensive experiments show that it achieves an optimal balance between robustness and diversity. Our code is available at \href{https://github.com/0xD009/T2SMark}{https://github.com/0xD009/T2SMark}.

近年来,扩散模型发展迅速,能够生成高保真图像,同时引发了关于知识产权保护和生成式人工智能滥用的担忧。针对扩散模型的图像水印技术,尤其是噪声水印(NaW)方法,将水印编码为特定的标准高斯噪声向量,用于图像生成,实现无缝嵌入信息的同时保持图像质量。检测时,需要逆转生成过程以恢复包含水印的初始噪声向量再进行提取。然而,现有的NaW方法很难在水印的稳健性和生成的多样性之间取得平衡。一些方法通过严格约束初始噪声采样来实现强大的稳健性,这降低了用户体验;而其他方法虽然保持了多样性,但在现实世界的部署中却显得过于脆弱。为了解决这一问题,我们提出了基于尾截断采样(Tail-Truncated Sampling,简称TTS)的两阶段水印方案T2SMark。不同于现有方法简单地映射位到正值或负值,TTS通过仅在可靠的尾区域嵌入位并随机采样中心区域以保持潜在分布,增强了稳健性。我们的两阶段框架随后通过整合随机生成的会话密钥到加密管道中确保采样多样性。我们在具有U-Net和DiT骨干的扩散模型上评估了T2SMark。大量实验表明,它在稳健性和多样性之间达到了最优平衡。我们的代码可在https://github.com/0xD009/T2SMark上获取。

论文及项目相关链接

PDF Accepted by NeurIPS 2025

Summary

扩散模型近年来发展迅速,产生了高质量图像,引发了关于知识产权保护和滥用生成式人工智能的担忧。针对扩散模型的水印嵌入技术,尤其是噪声水印(NaW)方法,通过将水印编码为特定标准高斯噪声向量实现图像生成中的无缝嵌入。检测时,通过反转生成过程来恢复包含水印的初始噪声向量后进行提取。然而,现有NaW方法在平衡水印的稳健性和生成多样性方面存在挑战。有的方法通过严格限制初始噪声采样来实现强稳健性,但牺牲了用户体验;有的方法虽能保留多样性,但在实际部署中过于脆弱。为解决这一问题,本文提出基于尾截断采样(TTS)的两阶段水印方案T2SMark。TTS不同于以往简单地将位映射到正负值的方法,它通过仅在可靠的尾区域嵌入位来提高稳健性,同时保留中央区域的随机采样以保持潜在分布。两阶段框架还通过集成随机生成的会话密钥来保证采样多样性。实验表明,T2SMark在U-Net和DiT背书的扩散模型上实现了稳健性和多样性的最优平衡。

Key Takeaways

  1. 扩散模型产生高质量图像,引发知识产权保护和滥用生成式AI的担忧。
  2. 现有噪声水印(NaW)方法在平衡水印稳健性和生成多样性方面存在挑战。
  3. 本文提出的T2SMark基于尾截断采样(TTS)的两阶段水印方案旨在解决此问题。
  4. TTS在可靠的尾区域嵌入位以提高稳健性,同时保留中央区域的随机采样。
  5. T2SMark的两阶段框架集成随机生成的会话密钥,保证采样多样性。
  6. 实验表明,T2SMark在多种扩散模型上实现了稳健性和多样性的平衡。

Cool Papers

点此查看论文截图

SHAP Meets Tensor Networks: Provably Tractable Explanations with Parallelism

Authors:Reda Marzouk, Shahaf Bassan, Guy Katz

Although Shapley additive explanations (SHAP) can be computed in polynomial time for simple models like decision trees, they unfortunately become NP-hard to compute for more expressive black-box models like neural networks - where generating explanations is often most critical. In this work, we analyze the problem of computing SHAP explanations for Tensor Networks (TNs), a broader and more expressive class of models than those for which current exact SHAP algorithms are known to hold, and which is widely used for neural network abstraction and compression. First, we introduce a general framework for computing provably exact SHAP explanations for general TNs with arbitrary structures. Interestingly, we show that, when TNs are restricted to a Tensor Train (TT) structure, SHAP computation can be performed in poly-logarithmic time using parallel computation. Thanks to the expressiveness power of TTs, this complexity result can be generalized to many other popular ML models such as decision trees, tree ensembles, linear models, and linear RNNs, therefore tightening previously reported complexity results for these families of models. Finally, by leveraging reductions of binarized neural networks to Tensor Network representations, we demonstrate that SHAP computation can become efficiently tractable when the network’s width is fixed, while it remains computationally hard even with constant depth. This highlights an important insight: for this class of models, width - rather than depth - emerges as the primary computational bottleneck in SHAP computation.

虽然对于决策树等简单模型,Shapley加性解释(SHAP)可以在多项式时间内进行计算,但对于更复杂的黑盒模型(如神经网络)进行SHAP计算就会变得NP难解,而通常这些模型正是最需要进行解释的。在这项工作中,我们分析了为张量网络(TNs)计算SHAP解释的问题。TNs是一个更广泛且更具表现力的模型类别,超出了当前已知的精确SHAP算法的适用范围,广泛应用于神经网络的抽象和压缩。首先,我们为计算一般TNs的精确SHAP解释引入了一个通用框架,该框架具有任意结构。有趣的是,当我们限制TNs为张量列车(TT)结构时,可以使用并行计算以多项式对数时间在TT上进行SHAP计算。由于TT的表现力强大,这个复杂度结果可以推广到许多其他流行的机器学习模型,如决策树、树集成、线性模型和线性RNN,从而提高了之前对这些模型家族报告的复杂度结果。最后,通过利用二值化神经网络到张量网络表示的简化,我们证明当网络的宽度固定时,SHAP计算可以变得高效可行,即使深度恒定,它仍然在计算上是困难的。这突显了一个重要见解:对于这类模型,宽度而不是深度是SHAP计算中的主要计算瓶颈。

论文及项目相关链接

PDF To appear in NeurIPS 2025

Summary

本文研究了计算Tensor Networks(TNs)的SHAP解释的问题。尽管对于简单的模型如决策树,SHAP可以在多项式时间内计算,但对于更复杂的黑箱模型如神经网络,计算SHAP解释变得NP难。本文引入了一个通用的框架来计算一般TNs的确切SHAP解释,并发现当TNs限制为Tensor Train(TT)结构时,可以使用并行计算在多项式对数时间内进行SHAP计算。此外,该复杂性结果可以推广到许多其他流行的ML模型,包括决策树、树集合、线性模型和线性RNNs等。最后,通过将二值化神经网络简化为张量网络表示,我们发现当网络的宽度固定时,SHAP计算变得有效可行,而即使深度为常数时,它仍然计算上很困难。这表明对于此类模型,宽度而不是深度是SHAP计算中的主要计算瓶颈。

Key Takeaways

  1. 计算SHAP解释对于复杂的黑箱模型如神经网络是NP难的。
  2. 引入了一个通用的框架来计算Tensor Networks(TNs)的确切SHAP解释。
  3. 当TNs限制为Tensor Train(TT)结构时,SHAP计算可以在多项式对数时间内进行,并且这一复杂性结果可以推广到许多其他流行的ML模型。
  4. 通过张量网络表示,二值化神经网络的SHAP计算在固定宽度时变得有效可行。
  5. SHAP计算的瓶颈在于模型的宽度而非深度。
  6. TT结构在并行计算中表现出优势。

Cool Papers

点此查看论文截图

Continuous-Token Diffusion for Speaker-Referenced TTS in Multimodal LLMs

Authors:Xinlu He, Swayambhu Nath Ray, Harish Mallidi, Jia-Hong Huang, Ashwin Bellur, Chander Chandak, M. Maruf, Venkatesh Ravichandran

Unified architectures in multimodal large language models (MLLM) have shown promise in handling diverse tasks within a single framework. In the text-to-speech (TTS) task, current MLLM-based approaches rely on discrete token representations, which disregard the inherently continuous nature of speech and can lead to loss of fine-grained acoustic information. In this work, we investigate the TTS within the MLLM paradigm using continuous speech representations. We design a dual-head architecture and implement two complementary training strategies for a robust model. (1) A diffusion head generating continuous speech representations is added on the MLLM, which is on frame-level and strictly autoregressive. (2) The original language model head is retained to preserve multitask capability and to control the start and end of speech synthesis. (3) Masked training is employed to address exposure bias in autoregressive decoding. (4) To stabilize optimization, we propose a two-stage scheme where the LM is frozen in the second stage, ensuring the diffusion head learns from a fixed input distribution. Evaluations on LibriSpeech(PC) test-clean show that our approach achieves state-of-the-art autoregressive performance, with a WER of 1.95%, speaker similarity of 0.54, and UTMOS of 4.00. The two-stage training yields a 46% relative WER reduction over the one-stage training baseline. These results highlight the effectiveness of combining autoregressive modeling with continuous-token diffusion, supported by a two-stage training procedure.

在多模态大型语言模型(MLLM)中,统一架构在单个框架内处理各种任务方面显示出巨大潜力。在文本到语音(TTS)任务中,当前的基于MLLM的方法依赖于离散令牌表示,这忽略了语音的固有连续性,并可能导致精细的声学信息丢失。在这项工作中,我们利用连续语音表示来研究MLLM范式内的TTS。我们设计了一个双头架构,并实施了两种互补的训练策略,以构建稳健的模型。(1)在MLLM上增加了一个生成连续语音表示的分形头,该头是框架级的,并且是完全自回归的。(2)保留原始语言模型头,以保持多任务能力并控制语音合成的开始和结束。(3 to address exposure bias in autoregressive decoding, masked training is employed.(3)为解决自回归解码中的曝光偏见,采用了掩模训练。(4)为了优化稳定,我们提出了一个两阶段方案,在第二阶段冻结LM,确保扩散头从固定的输入分布中学习。在LibriSpeech(PC)测试清洁上的评估表明,我们的方法达到了最先进的自回归性能,词错误率为1.95%,说话人相似度为0.54,UTMOS为4.00。与单阶段训练基线相比,两阶段训练使WER相对降低了46%。这些结果突显了结合自回归建模与连续令牌扩散的有效性,并得到两阶段训练程序的支撑。

论文及项目相关链接

PDF

摘要

多模态大型语言模型(MLLM)的统一架构在处理多样化任务方面表现出潜力。在文本转语音(TTS)任务中,当前的MLLM方法依赖于离散令牌表示形式,忽略了语音的固有连续性,可能导致精细声学信息的丢失。本研究在MLLM框架下探讨TTS的连续语音表示形式。设计了一种双头架构,并实施两种互补的训练策略以实现稳健的模型:添加一个扩散头生成连续语音表示形式,该形式基于帧级且严格自回归;保留原始语言模型头以保留多任务功能并控制语音合成的开始和结束;采用掩蔽训练解决自回归解码中的暴露偏差问题;提出一种两阶段方案以稳定优化,在第二阶段冻结语言模型,确保扩散头从固定的输入分布中学习。在LibriSpeech(PC)测试清洁上的评估表明,我们的方法达到了最先进的自回归性能,词错误率为1.95%,说话人相似度为0.54,UTMOS为4.00。两阶段训练与单阶段训练基线相比,相对词错误率降低了46%。这些结果突显了结合自回归建模与连续令牌扩散的有效性,辅以两阶段训练程序。

关键见解

  1. 多模态大型语言模型的统一架构在处理多样化任务方面具有优势。
  2. 当前TTS任务中的MLLM方法主要依赖离散令牌表示,这可能导致精细声学信息的损失。
  3. 研究采用连续语音表示形式改进了TTS在MLLM框架下的表现。
  4. 设计了双头架构,结合扩散头和原始语言模型头,以实现稳健的TTS性能。
  5. 采用了掩蔽训练来解决自回归解码中的暴露偏差问题。
  6. 提出两阶段训练方案,以优化模型性能并降低词错误率。

Cool Papers

点此查看论文截图

Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models

Authors:Yolo Yunlong Tang, Jing Bi, Pinxin Liu, Zhenyu Pan, Zhangyun Tan, Qianxiang Shen, Jiani Liu, Hang Hua, Junjia Guo, Yunzhong Xiao, Chao Huang, Zhiyuan Wang, Susan Liang, Xinyi Liu, Yizhi Song, Junhua Huang, Jia-Xing Zhong, Bozheng Li, Daiqing Qi, Ziyun Zeng, Ali Vosoughi, Luchuan Song, Zeliang Zhang, Daiki Shimada, Han Liu, Jiebo Luo, Chenliang Xu

Video understanding represents the most challenging frontier in computer vision, requiring models to reason about complex spatiotemporal relationships, long-term dependencies, and multimodal evidence. The recent emergence of Video-Large Multimodal Models (Video-LMMs), which integrate visual encoders with powerful decoder-based language models, has demonstrated remarkable capabilities in video understanding tasks. However, the critical phase that transforms these models from basic perception systems into sophisticated reasoning engines, post-training, remains fragmented across the literature. This survey provides the first comprehensive examination of post-training methodologies for Video-LMMs, encompassing three fundamental pillars: supervised fine-tuning (SFT) with chain-of-thought, reinforcement learning (RL) from verifiable objectives, and test-time scaling (TTS) through enhanced inference computation. We present a structured taxonomy that clarifies the roles, interconnections, and video-specific adaptations of these techniques, addressing unique challenges such as temporal localization, spatiotemporal grounding, long video efficiency, and multimodal evidence integration. Through systematic analysis of representative methods, we synthesize key design principles, insights, and evaluation protocols while identifying critical open challenges in reward design, scalability, and cost-performance optimization. We further curate essential benchmarks, datasets, and metrics to facilitate rigorous assessment of post-training effectiveness. This survey aims to provide researchers and practitioners with a unified framework for advancing Video-LMM capabilities. Additional resources and updates are maintained at: https://github.com/yunlong10/Awesome-Video-LMM-Post-Training

视频理解是计算机视觉领域最具挑战性的前沿课题,它要求模型能够推理复杂的时空关系、长期依赖关系和多模态证据。最近出现的视频大型多模态模型(Video-LMMs),集成了视觉编码器和基于强大解码器的语言模型,在视频理解任务中表现出了卓越的能力。然而,将这些模型从基本感知系统转变为先进推理引擎的关键阶段——即后训练阶段,在文献中仍然分散。本文首次全面调查了视频大型多模态模型的后训练方法论,包括三个基本支柱:通过思考链进行有监督微调(SFT)、基于可验证目标的强化学习(RL)以及通过增强推理计算进行测试时缩放(TTS)。我们提出了一种结构化的分类法,阐明了这些方法的作用、相互关联以及针对视频的特定适应,应对独特的挑战,如时间定位、时空定位、长视频效率和多模态证据集成。通过对代表性方法的系统分析,我们综合了关键的设计原则、见解和评估协议,同时确定了奖励设计、可扩展性和成本性能优化方面的关键开放挑战。我们还整理了重要的基准测试、数据集和指标,以便对后训练的有效性进行严格评估。本调查旨在为研究人员和从业者提供一个统一框架,以推进视频大型多模态模型的能力。更多资源和更新信息请访问:https://github.com/yunlong10/Awesome-Video-LMM-Post-Training

论文及项目相关链接

PDF Version v1.1

摘要

视频理解是计算机视觉领域最具挑战性的前沿课题,需要模型推理复杂的时空关系、长期依赖性和多模态证据。视频大型多模态模型(Video-LMMs)的出现,通过整合视觉编码器和基于语言模型的强大解码器,展现了视频理解任务的卓越能力。然而,将这些模型从基本感知系统转变为先进推理引擎的后续训练阶段,在文献中仍然分散。本文首次全面调查了Video-LMMs的后续训练方法,包括三个基本支柱:通过思维链进行有监督微调(SFT)、基于可验证目标的强化学习(RL)以及通过增强推理计算进行测试时间缩放(TTS)。我们提供了一个结构化分类法,阐明了这些方法在视频理解中的角色、相互关联和特定适应性,并应对独特的挑战,如时间定位、时空定位、长视频效率和多模态证据集成。通过对代表性方法的系统分析,我们综合了关键设计原则、见解和评估协议,同时确定了奖励设计、可扩展性和成本性能优化方面的关键开放挑战。此外,我们还整理了重要的基准测试、数据集和指标,以促进对后续训练有效性的严格评估。本文旨在为研究人员和从业者提供一个统一框架,以推进Video-LMM的能力。

关键见解

  1. 视频理解面临诸多挑战,包括复杂的时空关系、长期依赖性和多模态证据的理解。
  2. Video-Large Multimodal Models(Video-LMMs)在视频理解任务中表现出卓越的能力,通过整合视觉编码器与语言模型。
  3. 后续训练是使模型从基本感知系统转变为先进推理引擎的关键阶段。
  4. 论文提供了Video-LMMs的后续训练方法的首次全面调查,包括有监督微调、强化学习和测试时间缩放三个基本支柱。
  5. 论文介绍了这些方法在应对独特挑战如时间定位、时空定位、长视频效率和多模态证据集成方面的作用。
  6. 论文还系统分析了关键设计原则、关键见解和评估协议,并识别出奖励设计、可扩展性和成本性能优化等方面的开放挑战。
  7. 论文提供了重要的基准测试、数据集和指标,以促进对后续训练有效性的评估,并为研究者和实践者提供了一个统一框架。

Cool Papers

点此查看论文截图

SimulMEGA: MoE Routers are Advanced Policy Makers for Simultaneous Speech Translation

Authors:Chenyang Le, Bing Han, Jinshun Li, Songyong Chen, Yanmin Qian

Simultaneous Speech Translation (SimulST) enables real-time cross-lingual communication by jointly optimizing speech recognition and machine translation under strict latency constraints. Existing systems struggle to balance translation quality, latency, and semantic coherence, particularly in multilingual many-to-many scenarios where divergent read and write policies hinder unified strategy learning. In this paper, we present SimulMEGA (Simultaneous Generation by Mixture-of-Experts Gating), an unsupervised policy learning framework that combines prefix-based training with a Mixture-of-Experts refiner to learn effective read and write decisions in an implicit manner, without adding inference-time overhead. Our design requires only minimal modifications to standard transformer architectures and generalizes across both speech-to-text and text-to-speech streaming tasks. Through comprehensive evaluation on six language pairs, our 500M parameter speech-to-text model outperforms the Seamless baseline, achieving under 7 percent BLEU degradation at 1.5 seconds average lag and under 3 percent at 3 seconds. We further demonstrate the versatility of SimulMEGA by extending it to streaming TTS with a unidirectional backbone, yielding superior latency quality tradeoffs.

同步语音识别翻译(SimulST)通过联合优化语音识别和机器翻译,在严格的延迟限制下实现实时跨语言沟通。现有系统在平衡翻译质量、延迟和语义连贯性方面存在困难,特别是在多语言多对多的场景中,不同的读取和写入策略阻碍了统一策略学习。在本文中,我们提出了SimulMEGA(基于专家混合门控的同步生成),这是一种无监督的策略学习框架,它将前缀训练与专家混合精炼器相结合,以隐式的方式学习有效的读取和写入决策,而不会增加推理时间开销。我们的设计只需要对标准变压器架构进行最小修改,就可以广泛应用于语音到文本和文本到语音的流式处理任务。通过对六种语言对的全面评估,我们的50亿参数语音到文本模型优于无缝基线,在平均延迟1.5秒的情况下,BLEU得分降低不到7%,在3秒延迟的情况下,降低不到3%。我们还将SimulMEGA扩展到具有单向主干流的文本到语音合成,进一步证明了其通用性,在延迟质量交易方面表现出卓越的优势。

论文及项目相关链接

PDF NeurIPS 2025 poster

摘要

SimulMEGA实现了一种多任务协同的语音即时翻译技术,通过混合专家门控策略优化读写字策略,提高翻译质量、实时性和语义连贯性。该方法适用于多语种场景,具有隐式学习优势,可快速部署在标准转换器架构上,用于语音转文本和文本转语音流媒体任务。实验表明,在六种语言对上的表现优于无缝基线技术,平均延迟时间短,翻译质量高。此外,该技术还成功应用于单向骨架TTS,具有优异的延迟质量权衡效果。

要点归纳

  1. SimulMEGA是一种多任务协同的语音即时翻译技术框架。
  2. 利用混合专家门控策略,能有效处理多种翻译中的读写策略问题。
  3. SimulMEGA提高了翻译质量、实时性和语义连贯性。
  4. 该技术适用于多语种场景下的翻译任务。
  5. SimulMEGA具有隐式学习优势,可快速部署在标准转换器架构上。
  6. 实验结果显示SimulMEGA在语音转文本和文本转语音流媒体任务上表现优异。相较于无缝基线技术有更佳的翻译质量和延迟表现。在六种语言对上的实验表现得到了验证。此外在延迟和音质方面获得了较好的权衡效果。尤其在延迟超过一定的临界值后翻译质量保持优秀。并且能成功应用于单向骨架TTS。

Cool Papers

点此查看论文截图

EmoSteer-TTS: Fine-Grained and Training-Free Emotion-Controllable Text-to-Speech via Activation Steering

Authors:Tianxin Xie, Shan Yang, Chenxing Li, Dong Yu, Li Liu

Text-to-speech (TTS) has shown great progress in recent years. However, most existing TTS systems offer only coarse and rigid emotion control, typically via discrete emotion labels or a carefully crafted and detailed emotional text prompt, making fine-grained emotion manipulation either inaccessible or unstable. These models also require extensive, high-quality datasets for training. To address these limitations, we propose EmoSteer-TTS, a novel training-free approach, to achieve fine-grained speech emotion control (conversion, interpolation, erasure) by activation steering. We first empirically observe that modifying a subset of the internal activations within a flow matching-based TTS model can effectively alter the emotional tone of synthesized speech. Building on this insight, we then develop a training-free and efficient algorithm, including activation extraction, emotional token searching, and inference-time steering, which can be seamlessly integrated into a wide range of pretrained models (e.g., F5-TTS, CosyVoice2, and E2-TTS). In addition, to derive effective steering vectors, we construct a curated emotional speech dataset with diverse speakers. Extensive experiments demonstrate that EmoSteer-TTS enables fine-grained, interpretable, and continuous control over speech emotion, outperforming the state-of-the-art (SOTA). To the best of our knowledge, this is the first method that achieves training-free and continuous fine-grained emotion control in TTS. Demo samples are available at https://emosteer-tts-demo.pages.dev/.

文本转语音(TTS)在最近几年取得了巨大的进步。然而,大多数现有的TTS系统只提供粗糙且僵化的情感控制,通常是通过离散的情绪标签或精心设计和详细设计的情绪文本提示来实现的,这使得精细粒度的情感操纵要么无法实现,要么不稳定。这些模型还需要大量高质量的数据集进行训练。为了解决这些局限性,我们提出了EmoSteer-TTS,这是一种新型的无训练方法,通过激活控制实现精细粒度的语音情感控制(转换、插值、擦除)。我们首先从实证上观察到,基于流匹配的TTS模型内部激活的一个子集修改可以有效地改变合成语音的情感基调。基于这一见解,然后我们开发了一种无训练和高效的算法,包括激活提取、情感符号搜索和推理时间控制,可以无缝集成到各种预训练模型(例如F5-TTS、CosyVoice2和E2-TTS)。此外,为了得出有效的控制向量,我们构建了一个具有多种说话人的精选情感语音数据集。大量实验表明,EmoSteer-TTS能够对语音情感进行精细、可解释和连续的控制,优于现有最先进的水平。据我们所知,这是首次实现无训练和连续精细情感控制的TTS方法。演示样品可在https://emosteer-tts-demo.pages.dev/找到。

论文及项目相关链接

PDF 25 pages, 9 figures, 3 tables

Summary

本文介绍了针对文本转语音(TTS)技术的情感控制问题,提出了一种无需训练的新型方法——EmoSteer-TTS。该方法通过观察并修改基于流匹配的TTS模型内部激活的一部分,实现对合成语音情感的有效控制。该方法包括激活提取、情感标记搜索和推理时间控制,并能无缝集成到各种预训练模型中。此外,构建了包含不同说话人的高质量情感语音数据集,用于获取有效的控制向量。实验表明,EmoSteer-TTS可实现精细、可解释和连续的情感控制,优于现有技术。

Key Takeaways

  1. TTS技术在情感控制方面存在局限,大多只能进行粗略且固定的情感调节。
  2. EmoSteer-TTS是一种新型、无需训练的TTS方法,实现了对语音情感的精细控制。
  3. EmoSteer-TTS通过修改TTS模型的内部激活部分来改变合成语音的情感调子。
  4. 该方法包括激活提取、情感标记搜索和推理时间控制等步骤。
  5. EmoSteer-TTS能无缝集成到各种预训练模型中,并构建了高质量的情感语音数据集。
  6. EmoSteer-TTS实现了精细、可解释和连续的情感控制,优于当前的技术水平。

Cool Papers

点此查看论文截图

OpenS2S: Advancing Fully Open-Source End-to-End Empathetic Large Speech Language Model

Authors:Chen Wang, Tianyu Peng, Wen Yang, Yinan Bai, Guangfu Wang, Jun Lin, Lanpeng Jia, Lingxiang Wu, Jinqiao Wang, Chengqing Zong, Jiajun Zhang

Empathetic interaction is a cornerstone of human-machine communication, due to the need for understanding speech enriched with paralinguistic cues and generating emotional and expressive responses. However, the most powerful empathetic LSLMs are increasingly closed off, leaving the crucial details about the architecture, data and development opaque to researchers. Given the critical need for transparent research into the LSLMs and empathetic behavior, we present OpenS2S, a fully open-source, transparent and end-to-end LSLM designed to enable empathetic speech interactions. Based on our empathetic speech-to-text model BLSP-Emo, OpenS2S further employs a streaming interleaved decoding architecture to achieve low-latency speech generation. To facilitate end-to-end training, OpenS2S incorporates an automated data construction pipeline that synthesizes diverse, high-quality empathetic speech dialogues at low cost. By leveraging large language models to generate empathetic content and controllable text-to-speech systems to introduce speaker and emotional variation, we construct a scalable training corpus with rich paralinguistic diversity and minimal human supervision. We release the fully open-source OpenS2S model, including the dataset, model weights, pre-training and fine-tuning codes, to empower the broader research community and accelerate innovation in empathetic speech systems. The project webpage can be accessed at https://casia-lm.github.io/OpenS2S

共情交互是人类与机器通信的基石,这是因为需要理解融入副语言线索的语音并产生情感和表达性回应。然而,最强大的共情LSLM(语言模型)却越来越封闭,使研究者对架构、数据和发展等关键细节感到困惑。鉴于对LSLM和共情行为透明研究的迫切需求,我们推出了OpenS2S,这是一个完全开源、透明和端到端的LSLM,旨在实现共情语音交互。OpenS2S基于我们的共情语音到文本模型BLSP-Emo,并进一步采用流式交织解码架构来实现低延迟语音生成。为了促进端到端训练,OpenS2S融入了一个自动化数据构建管道,以低成本合成多样、高质量的共情语音对话。我们利用大型语言模型来生成共情内容,并利用可控的文本到语音系统来引入发言者和情感变化,从而构建了一个具有丰富的副语言多样性和最小人工监督的可扩展训练语料库。我们发布了完全开源的OpenS2S模型和一系列配套资源,包括数据集、模型权重、预训练和微调代码等,以支持更广泛的研究群体并加速共情语音系统的创新。项目网页可访问https://casia-lm.github.io/OpenS2S。

论文及项目相关链接

PDF Technical Report, Update on OpenS2S_v1.5

Summary

基于人类情感交流的需求,公开和透明的LSLM模型对实现人性化人机交互至关重要。本文提出OpenS2S模型,旨在通过开放源代码的方式推动共情语音识别技术的发展。该模型以BLSP-Emo模型为基础,具备流式解码架构,能够实现低延迟语音生成。此外,它采用了自动化数据构建流程,以低成本合成多样且高质量的共情对话数据。通过大规模语言模型生成共情内容,并结合可控的文本到语音系统引入说话人和情感变化,构建了一个具有丰富情感表达的可扩展训练语料库。OpenS2S模型的开源发布将促进更广泛的研究社区参与和共情语音系统的创新加速。

Key Takeaways

  1. 共情交互是人与机器通信的核心,需要理解包含副语言线索的语音并生成情感与表达回应。
  2. OpenS2S是一个开源、透明、端到端的LSLM模型,旨在实现共情语音识别交互。
  3. OpenS2S模型基于BLSP-Emo模型,采用流式解码架构实现低延迟语音生成。
  4. 自动化数据构建流程用于合成多样、高质量的共情对话数据,降低成本。
  5. 通过大规模语言模型生成共情内容,并结合可控文本到语音系统引入说话人和情感变化。
  6. OpenS2S模型构建了一个具有丰富情感表达的可扩展训练语料库,包含多样的副语言多样性。

Cool Papers

点此查看论文截图

OmniResponse: Online Multimodal Conversational Response Generation in Dyadic Interactions

Authors:Cheng Luo, Jianghui Wang, Bing Li, Siyang Song, Bernard Ghanem

In this paper, we introduce Online Multimodal Conversational Response Generation (OMCRG), a novel task designed to produce synchronized verbal and non-verbal listener feedback online, based on the speaker’s multimodal inputs. OMCRG captures natural dyadic interactions and introduces new challenges in aligning generated audio with listeners’ facial responses. To tackle these challenges, we incorporate text as an intermediate modality to connect audio and facial responses. We propose OmniResponse, a Multimodal Large Language Model (MLLM) that autoregressively generates accurate multimodal listener responses. OmniResponse leverages a pretrained LLM enhanced with two core components: Chrono-Text Markup, which precisely timestamps generated text tokens, and TempoVoice, a controllable online text-to-speech (TTS) module that outputs speech synchronized with facial responses. To advance OMCRG research, we offer ResponseNet, a dataset of 696 detailed dyadic interactions featuring synchronized split-screen videos, multichannel audio, transcripts, and annotated facial behaviors. Comprehensive evaluations on ResponseNet demonstrate that OmniResponse outperforms baseline models in terms of semantic speech content, audio-visual synchronization, and generation quality. Our dataset, code, and models are publicly available.

本文介绍了在线多模态对话响应生成(OMCRG)这一新任务,它旨在基于说话者的多模态输入,在线产生同步的言语和非言语听众反馈。OMCRG捕捉了自然的二元互动,并在将生成的音频与听众的面部响应对齐方面引入了新的挑战。为了应对这些挑战,我们采用文本作为连接音频和面部响应的中间模态。我们提出了OmniResponse,这是一种多模态大型语言模型(MLLM),可以自动生成准确的多模态听众响应。OmniResponse利用了一个预训练的大型语言模型,其中有两个核心组件:Chrono-Text Markup,它精确地标记生成文本的时间戳;以及TempoVoice,一个可控的在线文本到语音(TTS)模块,输出与面部响应同步的语音。为了推动OMCRG研究的发展,我们提供了ResponseNet数据集,其中包含696个详细的二元互动特征,具有同步分屏视频、多通道音频、文字记录以及注释的面部行为。在ResponseNet上的综合评估表明,OmniResponse在语义语音内容、视听同步和生成质量方面优于基准模型。我们的数据集、代码和模型均公开可用。

论文及项目相关链接

PDF 25 pages, 9 figures

Summary

本文介绍了在线多模态对话响应生成(OMCRG)任务,该任务旨在基于说话者的多模态输入,在线生成同步的言语和非言语听众反馈。为应对挑战,研究团队引入了OmniResponse这一多模态大型语言模型(MLLM),该模型可自动生成准确的多模态听众响应。OmniResponse利用了两项核心技术:Chrono-Text Markup和TempoVoice。前者为生成的文本标记提供精确时间戳,后者则是可控的在线文本到语音(TTS)模块,可输出与面部响应同步的语音。为推进OMCRG研究,研究团队推出了ResponseNet数据集,并进行了全面评估,证明了OmniResponse在语义语音内容、视听同步和生成质量方面优于基准模型。

Key Takeaways

  1. OMCRG任务旨在在线生成基于说话者多模态输入的同步言语和非言语听众反馈。
  2. OmniResponse是一个多模态大型语言模型,用于自动生成准确的多模态听众响应。
  3. OmniResponse包含两大核心技术:Chrono-Text Markup和TempoVoice,分别提供精确时间戳和在线文本到语音的转换。
  4. ResponseNet数据集包含了696个详细的双人互动数据,特征包括同步分屏视频、多通道音频、文字记录以及面部行为注释。
  5. OmniResponse在语义语音内容、视听同步和生成质量方面表现出优越性能。
  6. OmniResponse、数据集以及代码均已公开可用。

Cool Papers

点此查看论文截图

Rethinking Optimal Verification Granularity for Compute-Efficient Test-Time Scaling

Authors:Hao Mark Chen, Guanxi Lu, Yasuyuki Okoshi, Zhiwen Mo, Masato Motomura, Hongxiang Fan

Test-time scaling (TTS) has proven effective in enhancing the reasoning capabilities of large language models (LLMs). Verification plays a key role in TTS, simultaneously influencing (1) reasoning performance and (2) compute efficiency, due to the quality and computational cost of verification. In this work, we challenge the conventional paradigms of verification, and make the first attempt toward systematically investigating the impact of verification granularity-that is, how frequently the verifier is invoked during generation, beyond verifying only the final output or individual generation steps. To this end, we introduce Variable Granularity Search (VG-Search), a unified algorithm that generalizes beam search and Best-of-N sampling via a tunable granularity parameter g. Extensive experiments with VG-Search under varying compute budgets, generator-verifier configurations, and task attributes reveal that dynamically selecting g can improve the compute efficiency and scaling behavior. Building on these findings, we propose adaptive VG-Search strategies that achieve accuracy gains of up to 3.1% over Beam Search and 3.6% over Best-of-N, while reducing FLOPs by over 52%. We will open-source the code to support future research.

测试时缩放(TTS)已证明可以有效提高大型语言模型(LLM)的推理能力。验证在TTS中扮演关键角色,由于验证的质量和计算成本,它同时影响(1)推理性能和(2)计算效率。在这项工作中,我们挑战了验证的传统范式,并首次尝试系统地研究验证粒度的影响,即验证器在生成过程中被调用的频率,而不仅仅是验证最终输出或单个生成步骤。为此,我们引入了可变粒度搜索(VG-Search),这是一种通过可调粒度参数g推广束搜索和N选佳采样的统一算法。使用VG-Search在多种计算预算、生成器-验证器配置和任务属性下进行的广泛实验表明,动态选择g可以提高计算效率和缩放性能。基于这些发现,我们提出了自适应VG-Search策略,与束搜索相比提高了高达3.1%的准确率,与N选佳采样相比提高了3.6%的准确率,同时减少了超过52%的浮点运算次数。我们将公开源代码以支持未来的研究。

论文及项目相关链接

PDF Accepted at NeurIPS 2025

摘要
测试时间缩放(TTS)已证明可以提高大型语言模型(LLM)的推理能力。验证在TTS中扮演关键角色,同时影响推理性能和计算效率,这取决于验证的质量和计算成本。本研究挑战了验证的传统范式,首次尝试系统地研究验证粒度的影响,即验证器在生成过程中被调用的频率,而不仅仅是验证最终输出或单个生成步骤。为此,我们引入了可变粒度搜索(VG-Search),通过可调粒度参数g,统一了束搜索和最佳N采样。在VG-Search下的各种计算预算、生成器-验证器配置和任务属性进行的广泛实验表明,动态选择g可以提高计算效率和缩放性能。基于这些发现,我们提出了自适应VG-Search策略,与束搜索相比提高了高达3.1%的准确率,与最佳N采样相比提高了3.6%的准确率,同时减少了超过52%的浮点运算次数。我们将公开源代码以支持未来的研究。

关键见解

  1. 测试时间缩放(TTS)增强了大型语言模型(LLM)的推理能力。
  2. 验证在TTS中同时影响推理性能和计算效率。
  3. 引入了一种新的验证方法——可变粒度搜索(VG-Search),通过调整粒度参数g来优化搜索过程。
  4. VG-Search在多种计算预算、生成器-验证器配置和任务属性下的实验表明,动态调整粒度参数g可以提高计算效率和缩放性能。
  5. 自适应VG-Search策略在准确率上有了显著提升,同时减少了浮点运算次数。
  6. 本研究挑战了传统的验证范式,为未来的语言模型推理提供了新思路。

Cool Papers

点此查看论文截图

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

Authors:Chaoyou Fu, Haojia Lin, Xiong Wang, Yi-Fan Zhang, Yunhang Shen, Xiaoyu Liu, Haoyu Cao, Zuwei Long, Heting Gao, Ke Li, Long Ma, Xiawu Zheng, Rongrong Ji, Xing Sun, Caifeng Shan, Ran He

Recent Multimodal Large Language Models (MLLMs) have typically focused on integrating visual and textual modalities, with less emphasis placed on the role of speech in enhancing interaction. However, speech plays a crucial role in multimodal dialogue systems, and implementing high-performance in both vision and speech tasks remains a significant challenge due to the fundamental modality differences. In this paper, we propose a carefully designed multi-stage training methodology that progressively trains LLM to understand both visual and speech information, ultimately enabling fluent vision and speech interaction. Our approach not only preserves strong vision-language capacity, but also enables efficient speech-to-speech dialogue capabilities without separate ASR and TTS modules, significantly accelerating multimodal end-to-end response speed. By comparing our method against state-of-the-art counterparts across benchmarks for image, video, and speech tasks, we demonstrate that our model is equipped with both strong visual and speech capabilities, making near real-time vision and speech interaction. Code has been released at https://github.com/VITA-MLLM/VITA.

最近的多模态大型语言模型(MLLMs)主要集中于整合视觉和文本模态,较少关注语音在增强交互中的作用。然而,语音在多模态对话系统中扮演着至关重要的角色,由于在根本的模态差异,实现在视觉和语音任务中都高性能仍然是一个巨大的挑战。在本文中,我们提出了一种精心设计的多阶段训练方法论,逐步训练LLM以理解视觉和语音信息,最终使流畅的视觉和语音交互成为可能。我们的方法不仅保留了强大的视觉语言功能,而且能够高效地进行语音对话,无需单独的自动语音识别(ASR)和文本到语音(TTS)模块,极大地加速了多模态端到端的响应速度。通过与图像、视频和语音任务的最新基准测试进行比较,我们证明了我们的模型具有强大的视觉和语音功能,能够实现近乎实时的视觉和语音交互。代码已发布在https://github.com/VITA-MLLM/VITA。

论文及项目相关链接

PDF NeurIPS 2025 Spotlight, Code 2.4K Stars: https://github.com/VITA-MLLM/VITA

摘要
最新多模态大型语言模型(MLLMs)主要关注视觉和文本模态的融合,较少关注语音在增强交互中的作用。然而,语音在多模态对话系统中扮演关键角色,由于基本模态差异,在视觉和语音任务中实现高性能仍然是一个重大挑战。本文提出了一种精心设计的多阶段训练方法论,逐步训练大型语言模型以理解视觉和语音信息,最终实现流畅的视觉和语音交互。该方法不仅保留了强大的视觉语言功能,还能在不使用单独的ASR和TTS模块的情况下实现高效的语音对话功能,从而极大地加快了多模态端到端的响应速度。通过与图像、视频和语音任务的最新前沿技术进行比较,我们证明了该模型在视觉和语音方面都具有强大的功能,可实现近乎实时的视觉和语音交互。代码已发布在https://github.com/VITA-MLLM/VITA。

要点

  1. MLLMs在多模态对话系统中对语音的重要性被低估。
  2. 实现视觉和语音任务的高性能是一大挑战,原因在于基础模态的差异。
  3. 提出了一种多阶段训练大型语言模型的方法,使其能够同时理解视觉和语音信息。
  4. 方法不仅保留了强大的视觉语言功能,还能实现高效的语音对话功能,无需额外的ASR和TTS模块。
  5. 该方法显著提高了多模态端到端的响应速度。
  6. 与前沿技术相比,该模型在视觉和语音任务上具有强大的性能。

Cool Papers

点此查看论文截图

UVIT/AstroSat observation of TW Hya

Authors:Prasanta K. Nayak, Mayank Narang, P. Manoj, D. K. Ojha, Blesson Mathew, T. Baug, S. Chandra, S. Vig, G. Maheswar, U. S. Kamath

The paper demonstrates the spectroscopic and photometric capabilities of the Ultra-Violet Imaging Telescope (UVIT) to study T-Tauri stars (TTSs). We present the first UVIT/Far-UV spectrum of a TTS, TW Hya. Based on C IV line luminosity, we estimated accretion luminosity (0.1 $L_\odot$) and mass accretion rate (2.2 $\times$ $10^{-8} M_\odot /yr$) of TW Hya, and compared these values with the accretion luminosity (0.03 $L_\odot$) and mass accretion rate (0.6 $\times$ $10^{-8} M_\odot /yr$) derived from spectral energy distribution (SED). From the SED, we derive best-fitted parameters for TW Hya: $T_{eff}$ = 3900$\pm$50 K, radius = 1.2$\pm$0.03 $R_\odot$, $\mathrm{log}, g = 4.0$ and equivalent black-body temperatures corresponding to accretion luminosity as 14100$\pm$25 K. The parameters of TW Hya derived from UVIT observations were found to be matched well with the literature. Comparison with IUE spectra also suggests that UVIT can be used to study the spectroscopic variability of young stars. This study proposes leveraging the FUV spectroscopic capabilities of UVIT to contribute to the advancement of upcoming UV spectroscopic missions, including the Indian Spectroscopic Imaging Space Telescope (INSIST).

本文展示了紫外成像望远镜(UVIT)研究T Tauri恒星(TTS)的光谱和光度测量能力。我们展示了TTS的UVIT/远紫外光谱的首个实例,即TW Hydra。基于C IV线光度,我们估计了TW Hydra的吸积光度(0.1 $L_\odot$)和质量吸积率(2.2 x $10^{-8} M_\odot /yr$),并将这些值与通过谱能量分布(SED)得出的吸积光度(0.03 $L_\odot$)和质量吸积率(0.6 x $10^{-8} M_\odot /yr$)进行比较。通过SED,我们得出TW Hydra的最佳拟合参数:有效温度$T_{eff}$ = 3900±50 K,半径= 1.2±0.03 $R_\odot$,对数重力 $\mathrm{log}, g = 4.0$ ,以及对应于吸积光度的等效黑体温度为14100±25 K。从UVIT观测结果得出的TW Hydra的参数与文献中的记录相匹配。与IUE光谱的比较也表明,UVIT可用于研究年轻恒星的光谱变化。该研究建议充分利用UVIT的远紫外光谱功能,为即将到来的紫外光谱任务(包括印度光谱成像空间望远镜(INSIST))做出贡献。

论文及项目相关链接

PDF The paper has not been published yet and a major revision is required

Summary
该论文展示了紫外线成像望远镜(UVIT)研究T Tauri恒星(TTSs)的光谱和光度能力。我们对TW Hydra进行了首次UVIT/远紫外光谱观测,根据其CIV线光度估计了吸积光度(0.1 L⊙)和质量吸积率(2.2×10^-8 M⊙/yr),并与通过光谱能量分布(SED)得出的吸积光度(0.03 L⊙)和质量吸积率(0.6×10^-8 M⊙/yr)进行了比较。UVIT观测的TW Hydra参数与文献记载吻合良好。对比IUES光谱表明,UVIT可用于研究年轻恒星的光谱变化。该研究建议利用UVIT的远紫外光谱功能,为未来的紫外光谱任务,包括印度光谱成像空间望远镜(INSIST)做出贡献。

Key Takeaways

  1. UVIT具有研究TTSs的光谱和光度能力。
  2. TW Hydra的首次UVIT/Far-UV光谱显示其独特的特性。
  3. 通过CIV线亮度估计了TW Hydra的吸积亮度和质量吸积率。
  4. UVIT观测得出的TW Hydra参数与现有文献数据相符。
  5. UVIT观测结果与传统IUE光谱对比,验证了其在研究年轻恒星光谱变化方面的潜力。
  6. UVIT的FUV光谱功能对于推进未来的紫外光谱任务具有重要意义。

Cool Papers

点此查看论文截图


文章作者: Kedreamix
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !
 上一篇
Interactive Interactive
Interactive 方向最新论文已更新,请持续关注 Update in 2025-11-05 LISTEN to Your Preferences An LLM Framework for Multi-Objective Selection
2025-11-05
下一篇 
医学图像 医学图像
医学图像 方向最新论文已更新,请持续关注 Update in 2025-11-05 Navigated hepatic tumor resection using intraoperative ultrasound imaging
2025-11-05
  目录