Model convergence and inference #4

yangyyt · 2024-03-20T02:46:39Z

Regarding model training and inference, I have a few questions that I would like to ask.

The model has started training. How much will the loss finally converge to? The performance will be good.
For model inference code, should I refer to eval.ipynb or eval_audio.ipynb? In eval.ipynb, I don't find model.tts, model.audio_model, model.duration_model；In eval_audio.ipynb, Do I need to train a vocoder model and then test it?

Thank you very much for your reply.

ex3ndr · 2024-03-20T04:17:23Z

In my experiments loss doesn't change at all and stuck ~0.3, but i can observe the quality change and it improves and improves the longer i train, the longest i have trained 400k iterations max on two GPUs so far, not sure what would be later (i expect it to be better).

eval_audio is modern one, i haven't updated main eval. Vocoder is pretrained, it is provided in eval notebooks

yangyyt · 2024-03-20T05:51:25Z

My training loss dropped from 2.x to 1.X, and I trained for 1200+ steps. I only used the data from libriTTS, don't know if it's normal.

yangyyt · 2024-03-21T02:15:59Z

In my experiments loss doesn't change at all and stuck ~0.3, but i can observe the quality change and it improves and improves the longer i train, the longest i have trained 400k iterations max on two GPUs so far, not sure what would be later (i expect it to be better).

eval_audio is modern one, i haven't updated main eval. Vocoder is pretrained, it is provided in eval notebooks

It has dropped to about 0.3 today. I will test it to see the effect.

ex3ndr · 2024-03-21T06:20:45Z

I have updated all code in eval notebook, also published how to use instructions

yangyyt · 2024-03-21T07:07:27Z

I have updated all code in eval notebook, also published how to use instructions

Thanks a lot, I used the eval_audio.ipynb file to test my model. I found that the effect was not as good as yours. I am going to check my model.

I only used libriTTS data;
I didn’t use the style feature;
Not sure if these two have much impact, I'm going to see why.

ex3ndr · 2024-03-21T07:18:20Z

Style tokens (which are in fact just normalised pitch) improved emotional prosody a lot. Some of my notebooks has an example of inference without style tokens (I am training 10% without them to make it work).

yangyyt · 2024-03-23T06:08:37Z

Style tokens (which are in fact just normalised pitch) improved emotional prosody a lot. Some of my notebooks has an example of inference without style tokens (I am training 10% without them to make it work).

The generation is now normal. It seems that there is something wrong with the input of the audio model sample. and the log mel spec needs to be normalized (std: 2.1615, mean: -5.8843), but why is this step added? The spectrum was not normalized during model training. And how are std and mean calculated?
I have another question is, how was your voice_x.pt generated?

ex3ndr · 2024-03-23T06:29:01Z

It is normalized during training, this numbers are from voicebox paper, but i feel for my data they should be different, but i am not careful enough yet.
voice_x.pt is generated using generate_voices.py from the root of the repo.

yangyyt · 2024-03-23T06:34:40Z

got it, thank you.

zvorinji · 2024-04-07T08:26:51Z

@ex3ndr have you thought of using the semantic model from WhisperSpeech?

ex3ndr · 2024-04-07T16:51:37Z

@zvorinji hey, i am not convinced that Whisper has anything useful, i tried in the past to use it's latent outputs to predict presence of the voice, but it turns out training from scratch was much easier task. wav2vec would be more reasonable alternative, but honestly semantic-wise it is enough to have phonemes with pitch.

What is really missing is emotions and non-semantic information.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model convergence and inference #4

Model convergence and inference #4

yangyyt commented Mar 20, 2024

ex3ndr commented Mar 20, 2024

yangyyt commented Mar 20, 2024

yangyyt commented Mar 21, 2024

ex3ndr commented Mar 21, 2024

yangyyt commented Mar 21, 2024

ex3ndr commented Mar 21, 2024

yangyyt commented Mar 23, 2024

ex3ndr commented Mar 23, 2024

yangyyt commented Mar 23, 2024

zvorinji commented Apr 7, 2024

ex3ndr commented Apr 7, 2024

Model convergence and inference #4

Model convergence and inference #4

Comments

yangyyt commented Mar 20, 2024

ex3ndr commented Mar 20, 2024

yangyyt commented Mar 20, 2024

yangyyt commented Mar 21, 2024

ex3ndr commented Mar 21, 2024

yangyyt commented Mar 21, 2024

ex3ndr commented Mar 21, 2024

yangyyt commented Mar 23, 2024

ex3ndr commented Mar 23, 2024

yangyyt commented Mar 23, 2024

zvorinji commented Apr 7, 2024

ex3ndr commented Apr 7, 2024