HierSpeech++: All the Amazing Things It Could Do | HackerNoon
Briefly

In this work, we propose HierSpeech++, which achieves a human-level high-quality zero-shot speech synthesis performance. We introduce an efficient and powerful speech synthesis framework by disentangling semantic modeling, speech synthesizer, and speech super-resolution.
Moreover, we simply achieve this performance with a small-scale open-source dataset, LibriTTS. In addition, our model has a significantly faster inference speed than recently proposed zero-shot speech synthesis models.
Furthermore, we introduce a style prompt replication for 1s voice cloning, and noise-free speech synthesis by adopting a denoised style prompt.
For future works, we will extend the model to cross-lingual and emotion-controllable speech synthesis models by utilizing the pre-trained models.
Read at Hackernoon
[
|
]