Replies: 2 comments 11 replies
-
In theory training, a model with phonemes should converge faster. Finetuning the model with a new dataset and character set might cause samples like the above. If you are fine-tuning, do not touch its character set. If you need to change the character set, it is better to train it from scratch. If you think it's a bug, please create an issue with a snipped for reproduction and we can take a look at it in due time. |
Beta Was this translation helpful? Give feedback.
-
Posted this a week ago but still there are so many question marks on my head.. |
Beta Was this translation helpful? Give feedback.
-
Dear all,
I have been working on coqui TTS for about two months now, recording my voice and training with different models.
First model I tried was GlowTTS. After a few trials and errors, I somehow managed to clone my voice by training LJspeech dataset from scratch and fine tuning the model with my dataset. But for the sake of enhancing voice quality, I also tried to train vocoder model with the same dataset but it didn't get any better and have no idea how to solve that.
That's why I decided to train VITS model from scratch where both acoustic and vocoder training process are combined. Here my goal is to train the VITS model with phoneme characters.
First I started tried with the basic training script with LJSpeech dataset, which I can get from
train_vits.py with "use_phonemes=True".
Following above methods, the config.json file (automatically generated) shows that
I assumed that this characters config is base conf setup which is offered by coqui.
Thanks to @joachim from CoquiTTS Matrix channel: https://matrix.to/#/!ABCXMnQJVJnjTbtbIB:gitter.im/$Lrv9S-dWxORCBobVg67mV404qVMMDrRV08k3N2FjUB8?via=gitter.im&via=matrix.org&via=mozilla.org ,
I also found it is weird that "phonemes" are set as null (even though I set 'use_phonemes=True') and "characters" are set as phoneme characters. Also the output audio after training 80000 steps sounds weird. The below is an example sentence.
"This cake is great. It's so delicious and moist."
Weird, right? It is intelligible in a way and other people suggest me to train more but I think there's something wrong with the characters. (maybe not?)
So I decided to train again but this time I set the characters manually in the traning script as below. I refer this configuration from config.json from pretrained LJspeech-Vits model and "use_phonemes=True"
Still the result is very weird after 20000 steps, as you can listen from below link.
"This cake is great. It's so delicious and moist."
So I tried to train model again but without using phonemes (with the same setup but use_phonemes=False), after 20000 steps, the result is super intelligible.
"This cake is great. It's so delicious and moist."
Is this because training a model with using phoneme characters takes much more time than using letters? or is it just because I did something wrong (or ignorance) in characters config setup?
I think I still have no concept of how to match characters and phonemes.. and I really don't understand why all the trials with using phonemes are bad. Even though I am getting intelligible sound without using phonemes, I would like to train the model with phoneme characters for future voice experiments.
I am quite new in this domain and I kindly ask you to help on this. Any small intuition for understanding matching letters and phonemes, (e.g, which keywords should I look into to understand) would be super appreciated.
Thank you!!!!
Beta Was this translation helpful? Give feedback.
All reactions