Zero-shot Text-to-Speech With Prompts of 1s, 3s 5s, and 10s

from Hackernoon 10 months ago

In experiments comparing zero-shot TTS with prompt lengths ranging from 1s to 10s, it was found that 1s prompts struggle with effective speech synthesis, while longer prompts like 3s, 5s, and 10s reliably produce robust style transfer performance.
Hackernoonhttps://hackernoon.com/zero-shot-text-to-speech-with-prompts-of-1s-3s-5s-and-10s

Challenges with short prompts stem from potential absence of voicing within the slice, resulting in incomplete or non-informative input, further compounded by training that emphasizes full-length prompts for effective synthesis.
Hackernoonhttps://hackernoon.com/zero-shot-text-to-speech-with-prompts-of-1s-3s-5s-and-10s

To address issues with prompt length, we introduced style prompt replication, which enhances the robustness of TTS by effectively using the prompt in a manner similar to DNA replication, thus aiding in synthesizing longer sentences.
Hackernoonhttps://hackernoon.com/zero-shot-text-to-speech-with-prompts-of-1s-3s-5s-and-10s

Read at Hackernoon

#speech-synthesis #zero-shot-tts #prompt-length #style-transfer #neural-models

Collection

[

...

]

Zero-shot Text-to-Speech With Prompts of 1s, 3s 5s, and 10s | HackerNoonZero-shot Text-to-Speech With Prompts of 1s, 3s 5s, and 10s | HackerNoon Briefly

Zero-shot Text-to-Speech With Prompts of 1s, 3s 5s, and 10s | HackerNoon
Zero-shot Text-to-Speech With Prompts of 1s, 3s 5s, and 10s | HackerNoon
Briefly