Understanding characters and phonemes #2356

hjkaddict · 2023-02-21T11:18:29Z

hjkaddict
Feb 21, 2023

Dear all,

I have been working on coqui TTS for about two months now, recording my voice and training with different models.
First model I tried was GlowTTS. After a few trials and errors, I somehow managed to clone my voice by training LJspeech dataset from scratch and fine tuning the model with my dataset. But for the sake of enhancing voice quality, I also tried to train vocoder model with the same dataset but it didn't get any better and have no idea how to solve that.

That's why I decided to train VITS model from scratch where both acoustic and vocoder training process are combined. Here my goal is to train the VITS model with phoneme characters.

First I started tried with the basic training script with LJSpeech dataset, which I can get from
train_vits.py with "use_phonemes=True".
Following above methods, the config.json file (automatically generated) shows that

"characters": { 
        "characters_class": "TTS.tts.utils.text.characters.IPAPhonemes",
        "vocab_dict": null,
        "pad": "<PAD>",
        "eos": "<EOS>",
        "bos": "<BOS>",
        "blank": "<BLNK>",
        "characters": "iy\u0268\u0289\u026fu\u026a\u028f\u028ae\u00f8\u0258\u0259\u0275\u0264o\u025b\u0153\u025c\u025e\u028c\u0254\u00e6\u0250a\u0276\u0251\u0252\u1d7b\u0298\u0253\u01c0\u0257\u01c3\u0284\u01c2\u0260\u01c1\u029bpbtd\u0288\u0256c\u025fk\u0261q\u0262\u0294\u0274\u014b\u0272\u0273n\u0271m\u0299r\u0280\u2c71\u027e\u027d\u0278\u03b2fv\u03b8\u00f0sz\u0283\u0292\u0282\u0290\u00e7\u029dx\u0263\u03c7\u0281\u0127\u0295h\u0266\u026c\u026e\u028b\u0279\u027bj\u0270l\u026d\u028e\u029f\u02c8\u02cc\u02d0\u02d1\u028dw\u0265\u029c\u02a2\u02a1\u0255\u0291\u027a\u0267\u02b2\u025a\u02de\u026b",
        "punctuations": "!'(),-.:;? ",
        "phonemes": null,
        "is_unique": false,
        "is_sorted": true
    },

I assumed that this characters config is base conf setup which is offered by coqui.

Thanks to @joachim from CoquiTTS Matrix channel: https://matrix.to/#/!ABCXMnQJVJnjTbtbIB:gitter.im/$Lrv9S-dWxORCBobVg67mV404qVMMDrRV08k3N2FjUB8?via=gitter.im&via=matrix.org&via=mozilla.org ,
I also found it is weird that "phonemes" are set as null (even though I set 'use_phonemes=True') and "characters" are set as phoneme characters. Also the output audio after training 80000 steps sounds weird. The below is an example sentence.
"This cake is great. It's so delicious and moist."

Weird, right? It is intelligible in a way and other people suggest me to train more but I think there's something wrong with the characters. (maybe not?)

So I decided to train again but this time I set the characters manually in the traning script as below. I refer this configuration from config.json from pretrained LJspeech-Vits model and "use_phonemes=True"

character_config = CharactersConfig ( 
     characters_class="TTS.tts.models.vits.VitsCharacters", 
     pad="_", 
     eos="", 
     bos="", 
     characters="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz", 
     punctuations=";:,.!?¡¿—…\"«»“” ", 
     phonemes="ɑɐɒæɓʙβɔɕçɗɖðʤəɘɚɛɜɝɞɟʄɡɠɢʛɦɧħɥʜɨɪʝɭɬɫɮʟɱɯɰŋɳɲɴøɵɸθœɶʘɹɺɾɻʀʁɽʂʃʈʧʉʊʋⱱʌɣɤʍχʎʏʑʐʒʔʡʕʢǀǁǂǃˈˌːˑʼʴʰʱʲʷˠˤ˞↓↑→↗↘'̩'ᵻ" 
)

Still the result is very weird after 20000 steps, as you can listen from below link.
"This cake is great. It's so delicious and moist."
So I tried to train model again but without using phonemes (with the same setup but use_phonemes=False), after 20000 steps, the result is super intelligible.
"This cake is great. It's so delicious and moist."

Is this because training a model with using phoneme characters takes much more time than using letters? or is it just because I did something wrong (or ignorance) in characters config setup?

I think I still have no concept of how to match characters and phonemes.. and I really don't understand why all the trials with using phonemes are bad. Even though I am getting intelligible sound without using phonemes, I would like to train the model with phoneme characters for future voice experiments.

I am quite new in this domain and I kindly ask you to help on this. Any small intuition for understanding matching letters and phonemes, (e.g, which keywords should I look into to understand) would be super appreciated.

Thank you!!!!

erogol · 2023-02-23T11:59:57Z

erogol
Feb 23, 2023
Maintainer

In theory training, a model with phonemes should converge faster. Finetuning the model with a new dataset and character set might cause samples like the above. If you are fine-tuning, do not touch its character set. If you need to change the character set, it is better to train it from scratch. If you think it's a bug, please create an issue with a snipped for reproduction and we can take a look at it in due time.

3 replies

hjkaddict Feb 23, 2023
Author

Thank you for reply @erogol. As I mentioned above, I had kept failing in fine tuning process because of matching characters issues. So I trained the model from scratch. (Maybe the additional explanation at the first paragraph above confused my point of the question, sorry.)

All examples above are VITS model trained from scratch with LJSpeech dataset with such manual characters config and I set "use_phonemes=True". However it does not produce good speech sound. On the other hand, when I setup "use_phonemes=False", the result is good. I don't understand why using phonemes are resulted in unintelligible speech.

nanonomad Feb 25, 2023

You may be able to salvage your training by starting a new session with restore and setting config.model_args.reinit_text_encoder=True after changing to phonemes and retraining the TE. For English it typically improves fast, 10-20k steps. IDK about other languages and character sets.
Back up old ckpts because this doesn't always help, but often does.
Be sure to set config.model_args.reinit_text_encoder=False again after the run because if you don't, your config.json will wipe the text encoder every time you initialize the trainer.

hjkaddict Feb 27, 2023
Author

Thank you for your comment @nanonomad. But I am still confused. As you said, setting up "config.model_args.reinit_text_encoder=True" is needed when I want to train with phoneme characters in fine-tuning process, do I understand correctly? But in my case, I started the training from scratch configuring "use_phonemes=True".

Or maybe you meant by restoring the unintelliglbe model which I already have trained with the phonemes above and setting "config.model_args.reinit_text_encoder=True" ?

hjkaddict · 2023-02-27T15:30:09Z

hjkaddict
Feb 27, 2023
Author

Posted this a week ago but still there are so many question marks on my head..
Could anyone provide a sample training script(.py) for VITS model in english which uses phonemes..?
This would be super appreciated..!

8 replies

abdouaziz Jun 19, 2023

@pas-valkov @ADD-eNavarro try to fine tuned , and make sure your dataset are in a Good Quality

ADD-eNavarro Jun 19, 2023

@abdouaziz Thanks for your answer. My dataset is good, it's M-AILabs tux part. And I have tried both re-training and fine-tuning, to no success. I take from your comment you did manage to get a working finetuned model, right? Care to share the details?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Understanding characters and phonemes #2356

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 11 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Understanding characters and phonemes #2356

hjkaddict Feb 21, 2023

Replies: 2 comments · 11 replies

erogol Feb 23, 2023 Maintainer

hjkaddict Feb 23, 2023 Author

nanonomad Feb 25, 2023

hjkaddict Feb 27, 2023 Author

hjkaddict Feb 27, 2023 Author

abdouaziz Jun 19, 2023

ADD-eNavarro Jun 19, 2023

catselectro Aug 11, 2023

abdouaziz Aug 18, 2023

ADD-eNavarro Aug 28, 2023

hjkaddict
Feb 21, 2023

Replies: 2 comments 11 replies

erogol
Feb 23, 2023
Maintainer

hjkaddict Feb 23, 2023
Author

hjkaddict Feb 27, 2023
Author

hjkaddict
Feb 27, 2023
Author