⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-02-12 更新
Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling
Authors:Runze Liu, Junqi Gao, Jian Zhao, Kaiyan Zhang, Xiu Li, Biqing Qi, Wanli Ouyang, Bowen Zhou
Test-Time Scaling (TTS) is an important method for improving the performance of Large Language Models (LLMs) by using additional computation during the inference phase. However, current studies do not systematically analyze how policy models, Process Reward Models (PRMs), and problem difficulty influence TTS. This lack of analysis limits the understanding and practical use of TTS methods. In this paper, we focus on two core questions: (1) What is the optimal approach to scale test-time computation across different policy models, PRMs, and problem difficulty levels? (2) To what extent can extended computation improve the performance of LLMs on complex tasks, and can smaller language models outperform larger ones through this approach? Through comprehensive experiments on MATH-500 and challenging AIME24 tasks, we have the following observations: (1) The compute-optimal TTS strategy is highly dependent on the choice of policy model, PRM, and problem difficulty. (2) With our compute-optimal TTS strategy, extremely small policy models can outperform larger models. For example, a 1B LLM can exceed a 405B LLM on MATH-500. Moreover, on both MATH-500 and AIME24, a 0.5B LLM outperforms GPT-4o, a 3B LLM surpasses a 405B LLM, and a 7B LLM beats o1 and DeepSeek-R1, while with higher inference efficiency. These findings show the significance of adapting TTS strategies to the specific characteristics of each task and model and indicate that TTS is a promising approach for enhancing the reasoning abilities of LLMs.
测试时缩放(TTS)是一种在推理阶段使用额外的计算来提高大型语言模型(LLM)性能的重要方法。然而,目前的研究并没有系统地分析策略模型、过程奖励模型(PRM)和问题难度如何影响TTS。这种分析上的缺失限制了TTS方法的理解和实际应用。在本文中,我们重点关注两个核心问题:(1)在不同策略模型、PRM和问题解决难度水平下,如何最优地进行测试时计算缩放?(2)在多大程度上,扩展的计算能提高LLM在复杂任务上的性能,并且是否可以通过这种方法使较小的语言模型在性能上超越较大的模型?通过对MATH-500和综合的AIME24任务的全面实验,我们得出以下观察结果:(1)计算最优的TTS策略高度依赖于策略模型、PRM和问题的选择难度。(2)使用我们计算最优的TTS策略,极小的策略模型甚至可以超越大型模型的表现。例如,在MATH-500任务上,一个1B的LLM能超过一个405B的LLM。此外,在MATH-500和AIME24上,一个0.5B的LLM表现优于GPT-4o,一个3B的LLM胜过了一个405B的LLM,而一个7B的LLM击败了o1和DeepSeek-R1,同时提高了推理效率。这些发现表明,适应TTS策略到每个任务和模型的具体特性是非常重要的,并且表明TTS是一种有望提高LLM推理能力的方法。
论文及项目相关链接
Summary
在测试时间进行扩展调整的方法用于提高大型语言模型的性能,主要运用推理阶段扩展计算。本研究通过一系列实验表明,最优的测试时间扩展策略依赖于策略模型、过程奖励模型和问题难度的选择。利用该策略,小型策略模型甚至可能超越大型模型的表现。这些发现突显了针对特定任务和模型特性调整测试时间扩展策略的重要性,并为增强大型语言模型的推理能力提供了前景。
Key Takeaways
- 测试时间扩展调整(TTS)是改善大型语言模型(LLM)性能的关键方法之一。通过推理阶段的额外计算来提高其表现。
- 测试时间扩展策略的最优方法取决于策略模型、过程奖励模型(PRM)和问题难度的选择。这些因素的组合对策略效果有重大影响。
- 在某些情况下,小型策略模型通过适应适当的测试时间扩展策略,可以超越大型模型的表现。例如,在MATH-500任务中,规模为1B的LLM表现优于规模为405B的LLM。
- 测试时间扩展策略对于适应不同任务和模型特性至关重要。这为提高大型语言模型的推理能力提供了有效途径。
- 实验结果显示,适当的测试时间扩展策略能够显著提高语言模型的性能,特别是在处理复杂任务时。这表明TTS是一种有前途的方法。
- TTS不仅对提高语言模型的性能有积极影响,而且能提高推理效率。这是TTS方法在实际应用中的一大优势。
点此查看论文截图



Speech to Speech Translation with Translatotron: A State of the Art Review
Authors:Jules R. Kala, Emmanuel Adetiba, Abdultaofeek Abayom, Oluwatobi E. Dare, Ayodele H. Ifijeh
A cascade-based speech-to-speech translation has been considered a benchmark for a very long time, but it is plagued by many issues, like the time taken to translate a speech from one language to another and compound errors. These issues are because a cascade-based method uses a combination of methods such as speech recognition, speech-to-text translation, and finally, text-to-speech translation. Translatotron, a sequence-to-sequence direct speech-to-speech translation model was designed by Google to address the issues of compound errors associated with cascade model. Today there are 3 versions of the Translatotron model: Translatotron 1, Translatotron 2, and Translatotron3. The first version was designed as a proof of concept to show that a direct speech-to-speech translation was possible, it was found to be less effective than the cascade model but was producing promising results. Translatotron2 was an improved version of Translatotron 1 with results similar to the cascade model. Translatotron 3 the latest version of the model is better than the cascade model at some points. In this paper, a complete review of speech-to-speech translation will be presented, with a particular focus on all the versions of Translatotron models. We will also show that Translatotron is the best model to bridge the language gap between African Languages and other well-formalized languages.
基于级联的语音到语音翻译很长一段时间以来一直被视为一个基准测试,但它存在很多问题,比如将语音从一种语言翻译到另一种语言所需的时间以及复合错误。这些问题的原因是,级联方法结合了语音识别、语音到文本的翻译,并最终进行文本到语音的翻译。谷歌设计了序列到序列的直接语音到语音翻译模型Translatotron,以解决与级联模型相关的复合错误问题。如今,Translatotron模型有3个版本:Translatotron 1、Translatotron 2和Translatotron3。第一个版本是为了证明直接语音到语音翻译的可能性而设计的,发现其效果不如级联模型,但产生了令人鼓舞的结果。Translatotron2是Translatotron 1的改进版,其结果与级联模型相似。Translatotron 3,即该模型的最新版本,在某些方面优于级联模型。本文将全面回顾语音到语音的翻译,特别关注所有版本的Translatotron模型。我们还将展示Translatotron是弥合非洲语言和其他规范化语言之间语言障碍的最佳模型。
论文及项目相关链接
PDF 12 pages and 3 figures
Summary
本文介绍了基于级联的语音到语音翻译长期被视为基准方法,但它存在很多问题,如翻译时间长和复合错误等。这些问题源于级联方法结合了语音识别、文本翻译和语音翻译等多种方法。为解决级联模型的复合错误问题,谷歌设计了序列到序列的直接语音到语音翻译模型Translatotron。目前已有三个版本的Translatotron模型:Translatotron 1、Translatotron 2和Translatotron 3。本文将对语音到语音翻译进行完整回顾,重点关注各版本的Translatotron模型,并证明Translatotron是弥补非洲语言与其他规范化语言之间语言鸿沟的最佳模型。
Key Takeaways
- 级联的语音到语音翻译存在翻译时间长和复合错误等问题。
- Translatotron模型是一个序列到序列的直接语音到语音翻译模型,旨在解决级联模型的复合错误问题。
- 目前已有三个版本的Translatotron模型,分别为Translatotron 1、Translatotron 2和Translatotron 3。
- Translatotron 1作为概念验证,证明了直接语音到语音翻译的可能性,但效果不如级联模型。
- Translatotron 2是Translatotron 1的改进版,其效果与级联模型相似。
- Translatotron 3在某些方面优于级联模型。
点此查看论文截图



BnTTS: Few-Shot Speaker Adaptation in Low-Resource Setting
Authors:Mohammad Jahid Ibna Basher, Md Kowsher, Md Saiful Islam, Rabindra Nath Nandi, Nusrat Jahan Prottasha, Mehadi Hasan Menon, Tareq Al Muntasir, Shammur Absar Chowdhury, Firoj Alam, Niloofar Yousefi, Ozlem Ozmen Garibay
This paper introduces BnTTS (Bangla Text-To-Speech), the first framework for Bangla speaker adaptation-based TTS, designed to bridge the gap in Bangla speech synthesis using minimal training data. Building upon the XTTS architecture, our approach integrates Bangla into a multilingual TTS pipeline, with modifications to account for the phonetic and linguistic characteristics of the language. We pre-train BnTTS on 3.85k hours of Bangla speech dataset with corresponding text labels and evaluate performance in both zero-shot and few-shot settings on our proposed test dataset. Empirical evaluations in few-shot settings show that BnTTS significantly improves the naturalness, intelligibility, and speaker fidelity of synthesized Bangla speech. Compared to state-of-the-art Bangla TTS systems, BnTTS exhibits superior performance in Subjective Mean Opinion Score (SMOS), Naturalness, and Clarity metrics.
本文介绍了BnTTS(孟加拉文本转语音)系统,这是基于孟加拉语音说话人自适应的TTS的首个框架,旨在使用最少的训练数据来弥补孟加拉语音合成中的差距。我们的方法基于XTTS架构,将孟加拉语整合到多语种TTS管道中,并进行了修改以考虑该语言的声音和语言特征。我们在带有相应文本标签的孟加拉语音数据集上预训练BnTTS 3.85k小时,并在我们提出的测试数据集上评估其在零样本和少量样本场景下的性能。在少量样本场景中的经验评估表明,BnTTS显著提高了合成孟加拉语音的自然度、清晰度和说话人的保真度。与最先进的孟加拉TTS系统相比,BnTTS在主观平均意见得分(SMOS)、自然度和清晰度指标上表现出卓越的性能。
论文及项目相关链接
PDF Accepted paper in NAACL 2025
Summary
本文介绍了Bangla文本转语音(TTS)系统,该系统基于少量训练数据设计,旨在缩小Bangla语音合成的差距。基于XTTS架构,该系统将Bangla语言集成到多语种TTS管道中,并针对该语言的语音和语言学特性进行了修改。通过预训练在包含相应文本标签的Bangla语音数据集上进行了训练,并在零样本和少量样本设置下进行了评估。实验结果表明,在少量样本设置下,该系统的自然度、清晰度和说话人保真度显著提高。与现有的Bangla TTS系统相比,它在主观平均意见得分(SMOS)、自然度和清晰度指标上表现更佳。
Key Takeaways
- BnTTS是首个针对Bangla语言的基于说话人适应的TTS框架。
- 该系统旨在使用少量训练数据缩小Bangla语音合成的差距。
- 基于XTTS架构构建,整合了Bangla语言的多语种TTS管道。
- 系统针对Bangla语言的语音和语言学特性进行了修改。
- 通过预训练在包含相应文本标签的Bangla语音数据集上进行训练。
- 在零样本和少量样本设置下进行了评估,显示出较高的自然度、清晰度和说话人保真度。
点此查看论文截图




Less is More for Synthetic Speech Detection in the Wild
Authors:Ashi Garg, Zexin Cai, Henry Li Xinyuan, Leibny Paola García-Perera, Kevin Duh, Sanjeev Khudanpur, Matthew Wiesner, Nicholas Andrews
Driven by advances in self-supervised learning for speech, state-of-the-art synthetic speech detectors have achieved low error rates on popular benchmarks such as ASVspoof. However, prior benchmarks do not address the wide range of real-world variability in speech. Are reported error rates realistic in real-world conditions? To assess detector failure modes and robustness under controlled distribution shifts, we introduce ShiftySpeech, a benchmark with more than 3000 hours of synthetic speech from 7 domains, 6 TTS systems, 12 vocoders, and 3 languages. We found that all distribution shifts degraded model performance, and contrary to prior findings, training on more vocoders, speakers, or with data augmentation did not guarantee better generalization. In fact, we found that training on less diverse data resulted in better generalization, and that a detector fit using samples from a single carefully selected vocoder and a single speaker achieved state-of-the-art results on the challenging In-the-Wild benchmark.
随着语音自监督学习的发展,最先进的合成语音检测器在诸如ASVspoof等流行基准测试上实现了低错误率。然而,先前的基准测试并未解决真实世界中语音的广泛变化范围问题。在真实条件下,报告的错误率是否现实?为了评估检测器在受控分布变化下的失效模式和稳健性,我们推出了ShiftySpeech基准测试,其中包含超过3000小时的合成语音,涵盖7个领域、6个文本转语音系统和3种语言。我们发现所有的分布变化都会降低模型性能,与之前的研究结果相反,使用更多编码器、扬声器或数据增强进行训练并不能保证更好的泛化能力。事实上,我们发现使用较少多样化数据进行训练会导致更好的泛化效果,使用来自单个精心挑选的编码器和单个扬声器的样本进行训练的检测器在具有挑战性的In-the-Wild基准测试中取得了最新成果。
论文及项目相关链接
摘要
基于自我监督学习在语音领域的最新进展,目前最先进的语音合成检测器已在如ASVspoof等流行基准测试上实现了低错误率。然而,现有的基准测试未能涵盖现实世界中语音的广泛变化。在现实条件下,报告的误差率是否现实?为评估检测器在可控分布变化下的失效模式和稳健性,我们推出了ShiftySpeech基准测试,其中包含超过3000小时的合成语音数据,涵盖7个领域、6个文本转语音系统和3种语言。我们发现所有的分布变化都会降低模型性能,与之前的研究结果相反,在更多vocoder、说话者或数据增强上进行训练并不能保证更好的泛化能力。实际上,我们发现训练在数据多样性较低的情况下表现更好,使用单一精心选择的vocoder和单一说话人的检测器样本在具有挑战性的In-the-Wild基准测试中取得了最佳表现。
关键要点
- 自监督学习推动语音合成检测器技术进步,但在现实条件下性能表现尚待验证。
- 引入ShiftySpeech基准测试,模拟现实世界中语音的多样性,以评估检测器的失效模式和稳健性。
- 分布变化对模型性能产生负面影响。
- 不同于先前研究,训练中使用更多vocoder、说话人或数据增强并不保证更好的泛化能力。
- 训练数据多样性较低时,模型性能更佳。
- 使用单一精心选择的vocoder和单一说话人的检测器样本在挑战性测试中表现最佳。
点此查看论文截图






IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System
Authors:Wei Deng, Siyi Zhou, Jingchen Shu, Jinchao Wang, Lu Wang
Recently, large language model (LLM) based text-to-speech (TTS) systems have gradually become the mainstream in the industry due to their high naturalness and powerful zero-shot voice cloning capabilities.Here, we introduce the IndexTTS system, which is mainly based on the XTTS and Tortoise model. We add some novel improvements. Specifically, in Chinese scenarios, we adopt a hybrid modeling method that combines characters and pinyin, making the pronunciations of polyphonic characters and long-tail characters controllable. We also performed a comparative analysis of the Vector Quantization (VQ) with Finite-Scalar Quantization (FSQ) for codebook utilization of acoustic speech tokens. To further enhance the effect and stability of voice cloning, we introduce a conformer-based speech conditional encoder and replace the speechcode decoder with BigVGAN2. Compared with XTTS, it has achieved significant improvements in naturalness, content consistency, and zero-shot voice cloning. As for the popular TTS systems in the open-source, such as Fish-Speech, CosyVoice2, FireRedTTS and F5-TTS, IndexTTS has a relatively simple training process, more controllable usage, and faster inference speed. Moreover, its performance surpasses that of these systems. Our demos are available at https://index-tts.github.io.
最近,基于大型语言模型(LLM)的文本到语音(TTS)系统已逐渐成为行业主流,因为它们具有高度自然性和强大的零样本语音克隆能力。在这里,我们介绍了IndexTTS系统,它主要基于XTTS和Tortoise模型,并添加了一些新颖改进。具体而言,在中国场景中,我们采用了一种结合字符和拼音的混合建模方法,使多音字符和长尾字符的发音可控。我们还对声学语音符号的代码本利用进行了矢量量化(VQ)与有限标量量化(FSQ)的比较分析。为了进一步提高语音克隆的效果和稳定性,我们引入了一种基于卷积神经网络的语音条件编码器,并用BigVGAN2替换了语音代码解码器。与XTTS相比,它在自然性、内容一致性和零样本语音克隆方面取得了显著改进。至于开源中流行的TTS系统,如Fish-Speech、CosyVoice2、FireRedTTS和F5-TTS,IndexTTS具有相对简单的训练过程、更可控的使用方式和更快的推理速度。此外,其性能超越了这些系统。我们的演示作品可在https://index-tts.github.io查看。
论文及项目相关链接
Summary
基于大型语言模型的文本转语音(TTS)系统已成为行业主流,具有高度的自然性和强大的零样本语音克隆能力。本文介绍了IndexTTS系统,该系统基于XTTS和Tortoise模型,并进行了新颖的提升改进。针对中文场景,系统采用字符和拼音的混合建模方法,控制多音字符和长尾字符的发音。此外,对声音编码的向量量化与有限标量量化进行了比较分析。为提高语音克隆效果和稳定性,引入了基于卷积神经网络的语音条件编码器,并替换语音代码解码器为BigVGAN2。相较于XTTS,IndexTTS在自然性、内容一致性和零样本语音克隆方面取得了显著进步。其性能超越开源的TTS系统,如Fish-Speech、CosyVoice2、FireRedTTS和F5-TTS。
Key Takeaways
- 大型语言模型(LLM)在TTS系统中因其高自然性和零样本语音克隆能力而受到重视。
- IndexTTS系统基于XTTS和Tortoise模型,通过新颖改进提升性能。
- 在中文场景下,采用字符和拼音混合建模,控制多音和长尾字符发音。
- 比较分析了声音编码的向量量化与有限标量量化方法。
- 引入卷积神经网络语音条件编码器,替换语音代码解码器为BigVGAN2,提升语音克隆效果稳定性。
- IndexTTS在自然性、内容一致性和零样本语音克隆方面显著优于XTTS。
- IndexTTS性能超越现有开源TTS系统,如Fish-Speech、CosyVoice2等。
点此查看论文截图




Koel-TTS: Enhancing LLM based Speech Generation with Preference Alignment and Classifier Free Guidance
Authors:Shehzeen Hussain, Paarth Neekhara, Xuesong Yang, Edresson Casanova, Subhankar Ghosh, Mikyas T. Desta, Roy Fejgin, Rafael Valle, Jason Li
While autoregressive speech token generation models produce speech with remarkable variety and naturalness, their inherent lack of controllability often results in issues such as hallucinations and undesired vocalizations that do not conform to conditioning inputs. We introduce Koel-TTS, a suite of enhanced encoder-decoder Transformer TTS models that address these challenges by incorporating preference alignment techniques guided by automatic speech recognition and speaker verification models. Additionally, we incorporate classifier-free guidance to further improve synthesis adherence to the transcript and reference speaker audio. Our experiments demonstrate that these optimizations significantly enhance target speaker similarity, intelligibility, and naturalness of synthesized speech. Notably, Koel-TTS directly maps text and context audio to acoustic tokens, and on the aforementioned metrics, outperforms state-of-the-art TTS models, despite being trained on a significantly smaller dataset. Audio samples and demos are available on our website.
尽管自回归语音标记生成模型产生的语音具有显著的多样性和自然性,但它们缺乏可控性导致的内在问题往往会导致诸如幻觉和不符合条件输入的意外发声等问题。我们引入了Koel-TTS,这是一系列增强的编码器-解码器Transformer TTS模型,通过融入自动语音识别和说话人验证模型的偏好对齐技术来解决这些挑战。此外,我们还引入了无分类器指导,以进一步提高合成语音对文本和参考说话人音频的遵循性。我们的实验表明,这些优化措施显著提高了目标说话人的相似性、合成语音的清晰度和自然度。值得注意的是,Koel-TTS直接将文本和上下文音频映射到声音标记上,并且在上述指标上优于最先进的TTS模型,尽管它是在一个较小的数据集上训练的。音频样本和演示可在我们的网站上找到。
论文及项目相关链接
Summary
本文介绍了Koel-TTS,一种改进的编码解码器Transformer TTS模型。它通过结合偏好对齐技术,以自动语音识别和语音验证模型为指导,解决了生成模型缺乏控制力的问题。同时,引入无分类器引导技术进一步改进合成语音的遵循度。实验表明,这些优化显著提高了目标说话人的相似性、合成语音的清晰度和自然度。Koel-TTS直接映射文本和上下文音频到声音符号,并在相关指标上优于当前主流的TTS模型,尽管其训练数据集较小。
Key Takeaways
- Koel-TTS是一种改进的编码解码器Transformer TTS模型,解决了生成模型缺乏控制力的问题。
- 偏好对齐技术和自动语音识别及语音验证模型的结合,增强了模型的性能。
- 无分类器引导技术进一步提高了合成语音对文本和参考语音的遵循度。
- 实验显示,Koel-TTS在目标说话人的相似性、语音清晰度和自然度上有了显著的提升。
- Koel-TTS可以直接映射文本和上下文音频到声音符号。
- Koel-TTS在相关指标上的表现优于当前主流的TTS模型。
点此查看论文截图





Fast and High-Quality Auto-Regressive Speech Synthesis via Speculative Decoding
Authors:Bohan Li, Hankun Wang, Situo Zhang, Yiwei Guo, Kai Yu
The auto-regressive architecture, like GPTs, is widely used in modern Text-to-Speech (TTS) systems. However, it incurs substantial inference time, particularly due to the challenges in the next-token prediction posed by lengthy sequences of speech tokens. In this work, we introduce VADUSA, one of the first approaches to accelerate auto-regressive TTS through speculative decoding. Our results show that VADUSA not only significantly improves inference speed but also enhances performance by incorporating draft heads to predict future speech content auto-regressively. Furthermore, the inclusion of a tolerance mechanism during sampling accelerates inference without compromising quality. Our approach demonstrates strong generalization across large datasets and various types of speech tokens.
自动回归架构,如GPT,在现代文本到语音(TTS)系统中有着广泛应用。然而,它产生了大量的推理时间,尤其是因为长语音序列令牌所带来的下一个令牌预测挑战。在这项工作中,我们引入了VADUSA,这是第一个通过投机解码加速自动回归TTS的方法之一。我们的结果表明,VADUSA不仅显著提高了推理速度,而且通过融入草稿头进行未来语音内容的自回归预测,提高了性能。此外,采样过程中容错机制的加入在不损害质量的情况下加速了推理。我们的方法在大规模数据集和各种语音令牌类型上表现出了强大的泛化能力。
论文及项目相关链接
PDF Accepted by ICASSP 2025
Summary
本文介绍了在文本转语音(TTS)系统中使用的自回归架构面临的挑战,尤其是由于长语音序列所带来的下一代标记预测挑战而导致的推理时间显著增长的问题。为了加速自回归TTS系统,我们引入了VADUSA这一创新性方法。它不仅可以显著提高推理速度,还通过结合预测未来语音内容的草稿头进行自回归预测来提高性能。此外,采样过程中的容忍机制进一步加速了推理过程,同时保证了质量。我们的方法在大规模数据集和各种语音标记类型上表现出强大的泛化能力。
Key Takeaways
- 自回归架构在TTS系统中广泛应用,但由于长语音序列的下一代标记预测挑战导致推理时间较长。
- VADUSA是首个通过投机解码来加速自回归TTS的方法。
- VADUSA通过结合预测未来语音内容的草稿头进行自回归预测,提高了性能。
- 采样过程中的容忍机制进一步加速了推理过程,同时保证了质量。
- VADUSA在大规模数据集上表现出强大的性能。
- VADUSA对各种类型的语音标记具有良好的泛化能力。
- VADUSA的引入为加速TTS系统的推理过程提供了新的思路和方法。
点此查看论文截图



