In experiments comparing zero-shot TTS with prompt lengths ranging from 1s to 10s, it was found that 1s prompts struggle with effective speech synthesis, while longer prompts like 3s, 5s, and 10s reliably produce robust style transfer performance.
Challenges with short prompts stem from potential absence of voicing within the slice, resulting in incomplete or non-informative input, further compounded by training that emphasizes full-length prompts for effective synthesis.
To address issues with prompt length, we introduced style prompt replication, which enhances the robustness of TTS by effectively using the prompt in a manner similar to DNA replication, thus aiding in synthesizing longer sentences.
Collection
[
|
...
]