Optimizing XTTSv2 Cloning with Multiple Audio Tracks: Speed vs. Quality Trade-offs and Inference Efficiency #4013
Unanswered
240db
asked this question in
General Q&A
Replies: 1 comment
-
This is remarkable info. Can you provide short audio samples to demo your findings? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I have been using TTS for multilingual purposes, and I typically rely on XTTSv2. Until recently, I had always used a single audio track as the input
speaker_wav
, but I realized we could leverage multiple audio tracks to improve the cloning during inference. I conducted some tests using around 7-10 hours from one speaker, and 48 hours from another speaker.Inference Times
The first thing I noticed is that generation time increased significantly:
My
speaker_wav
tracks are in.wav
format and sampled at 44,100 kHz. I plan to try downsampling to 22,050 kHz or similar to see if it improves performance without sacrificing too much quality.Results
Despite the longer inference times, the results have been highly encouraging. The model's performance has significantly improved, especially in terms of handling long texts, and the pronunciation keeps getting more robust as I feed it more hours of reference audio.
Next Steps and Questions
I'm curious whether I can speed up the inference process by building a model of the
speaker_wav
files, instead of loading all these files during each generation. Would that be faster? Additionally, since we already build a dataset of transcripts, I wonder if pronunciation and cloning quality will improve by focusing more on the audio tracks alone.Beta Was this translation helpful? Give feedback.
All reactions