-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature request] Add Recipe for all 3 Training stages - XTTS V2 #3704
Comments
Ok so here you go. I picked the code for training from this repo.
Wrote a custom
This trains the DVAE to encode and decode mel-spectograms. Few things:
Next step would be to fine-tune a larger dataset. @erogol @eginhard if this is in the right direction, I can convert this into a training recipe PS: The code is a bit dirty since I have just re-used whatever was available as long as it doesn't harm my training. |
I also now understand that the decoder of DVAE is not used, but instead an LM head is used on the GPT-2 to recompute the mel from the audio-codes. Need to understand this a bit better before writing the next stage training code. |
Awesome! Amazing! Did you implement the stage 'Finally fine-tune end to end with the Hi-Fi GAN' ? |
May I ask a question haha, to train the dvae model, is it only necessary to use the features of the audio file? Text is not needed? |
yes. |
Hey @ScottishFold007 unfortunately no, we have been experimenting with fine-tuning just the GPT2 model with larger and much more accurately annotated custom datasets. In case you are facing quality issues, my suggestion would be to focus a lot on the dataset, it really helped us drastically improve quality. Particularly:
We are yet to pick up training for the other stages, it's in my to-do list. I just deprioritized it a bit since I did not get any response either from the repo owners or someone who has previously contributed to this. And I did not want to build something that might mislead people by implementing the wrong thing without peer review. |
I must say, you are very meticulous, kudos to you! Hasn't coqui-ai shut down? With no one maintaining it, I'm currently putting into practice the inspiration you provided. With a large amount of data, it still has a significant effect; moreover, training the dvae is just the first phase. After training is complete, we use this new dvae model to continue to the second phase: training the GPT model, followed by the third phase of training Hifi. I think that in the absence of peer review, we could team up to put this into practice, then report on progress and any issues that may arise, and work together to solve them. I'm not sure if you have WeChat (or any other social media), but I've started some discussion groups to explore each other's practical experiences and to pioneer together. |
my wechat: pineking, we can discuss the training questions. |
好的,加你了 |
@ScottishFold007 @pineking unfortunately I don't use wechat. Maybe we can connect on discord? There is this repository https://github.com/idiap/coqui-ai-TTS -> where they are maintaining a new pip package for TTS. I had asked the author if they would consider merging something like this, and he said he would, if we are able to replicate the TTS model from scratch. Also, currently I have 2-3 projects running, so not sure if I will move on this with speed, but happy to connect and contribute in any way I can every now and then. |
@smallsudarshan Hi, thank you for the code, I put everything in one place and made it easier for someone who will want to do a DVAE finetune, |
@daswer123 thanks a lot for picking up the baton! Few things I have observed:
One of the ways to make the model more robust in this to change the training recipe a bit. Currently the ljspeech data loader completely ignores speaker information. During training, the same sample is giving to the perceiver that needs to be synthesized. What if instead, we keep the speaker (and if applicable other characteristics like emotion) the same but use a sample with different spoken content? That way, the model might learn that it is the style from the speaker that has to be picked and it might also work a bit better for out-of-distribution (not sure though).
If this has to truly work, it needs to have explicit separate vectors maybe that represent emotion and speaker info? Point 2 is a bit of a deviation from the XTTS architecture, but point 1 seems simple to implement. |
@smallsudarshan @daswer123 If you looking for Hifigan XTTS training code. You can checkout this: https://github.com/tuanh123789/Train_Hifigan_XTTS |
@tuanh123789 Wow, thanks, it turns out we have the ability to fine-tune each component for XTTS. and can you tell me approximately how fine-tuning will affect the result, can we train on multiple speakers? And how do you think pipelines when we train one voice through all stages: DVAE -> GPT-2 -> HifiGAN , this should give a much better result than fine tuning GPT-2 |
I experiment with Ljspeech dataset both finetune and train from scratch and output very promising. With vietnamese I use 80h. Sure we can train on multi speakers |
One problem with finetune GPT part. The short text audio output is very bad, do you solve it @daswer123 |
@tuanh123789 Yeah, I noticed that, too. Unfortunately, I haven't found a solution yet. |
@smallsudarshan @daswer123 you never have to train a dvae , for finetuning only tune gpt-2 plus hifigan for finetung on larger datasets, dvae works for every langauge, you can even use a pretrained tortoise dvae. |
for a shorter text it's data problem , add enough short sentences and it'll work. |
thanks for response. After finetune gpt part with normal data, I use extra corpus about 11h of short text-audio to finetune one more time. But the results is not improve |
@manmay-nakhashi I'm not really familiar with all the processes and maybe I don't understand something, but why is fine tuning DVAE and then passing it to GPT-2 not necessary, wouldn't pre-training DVAE on the training dataset give GPT-2 a better view of the dataset? |
@daswer123 dvae is universal, can adapt to any language , it just learns how to compress a spectrogram. |
It's true, I implement training Dvae pipeline for Vietnamese, but the results is quite the same when using pretrain on other languages. But the short text after finetune gpt is the problem |
@tuanh123789 it's a data problem add lot's of single word and short sentences. |
Yeah, Let's try |
@manmay-nakhashi @tuanh123789 is the dvae even being used? I had checked it sometime back and I don't think it was being used. And yes, short text is just a simple data problem. One more problem - I have also seen short audio spikes at the end of speech, not sure how to solve it, but can probably be post-processed. |
@tuanh123789 did you try mix training? Sequential had not given great results for us. |
Do you use num_workers> 0 in dataloader? I get those gpu load graphs with DDP (gpu0 purple, gpu1 green - all the rest GPUs behave the same) With one GPU and num_workers > 0 things go the same way it only works with 1 GPU and num_workers=0 in my case It's probably not a hardware problem, tortoise TTS and some other DDP tunings go well, only coqui's Trainer has those problems |
Yes I set num_woker > 0. What hardware do you use? |
x6 RTX a6000 48GB, 512GB RAM, 128 amd cores, nvme fast ssds |
did you use standart coqui/TTS code to train? |
I use code provide by coqui |
Can you please tell me, do you use the same command? maybe it's the problem |
Can you provide sentences length ratio in training dataset. You said that adding single words during training. But in the code there is a section that removes audio segments < 0.5s |
you can reduce that to 0.3 may be if you want to just add hi, hello etc. |
Thank you 🤗 |
Hi @tuanh123789, have you overcome the short text error yet? |
Yes. Add more short sentences. And config min_condition_length in train smaller |
My thanks |
As I understand it, min_condition_length is only related to the reference audio. So how does it address the short text problem? |
Yes add more short audio and this config will solve the problem |
Can you provide me with more information about the number of hours of short audio and the specific min_condition_length value to achieve good results? |
Finetune Dvae with your data :D |
@tuanh123789 hi, can you share what changes did you make to the training code to enable fine-tuning on Vietnamese data? |
` ` |
Hello everyone, below is my code for fine-tuning XTTS for a new language. It works well in my case with over 100 hours of audio (even for short text), based on the code by @smallsudarshan |
Man can't thank you enough you just saved a lot of my time!! |
did some one try fine tuning perceiver sampler?? |
Can we conclude that DVAE retraining is not worth it? Can anyone confirm that it had a positive effect? |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels. |
Hi @tuanh123789 , I am trying to fine-tune xtts-v2. And i face the "short text" problem. The model hallucinates for shot text (1-3 words). I added a lot of short text data to my dataset but the problem persists. i kept min_conditioning_length in GPT config to 3s (as default). Is it NECESSARY to fine-tune VAE before the GPT model, or just changing min_conditioning_length to 0.5s in GPT config solves the problem? |
Changing condition length brakes model's abilities to voice cloning, but you can try Hifigan tuning is not necessary |
@tuanh123789 No need to train VAE? |
🚀 Feature Description
Hey, we saw that there is no training code for fine-tuning all parts of XTTS V2. We would like to contribute if it adds value.
The aim can be to make it work very reliably on a particular accent [Indian for eg.], in a particular language[English], in a particular speaking style with very little variability. We tried simply fine-tuning and it seems like it learns the accent somewhat and the speaking style, but is not super robust and mispronounces quite a lot.
Solution
We are not sure if the perceiver needs any fine-tuning.
If licenses permit, we will also share the data.
Does this make sense?
The text was updated successfully, but these errors were encountered: