Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model convergence and inference #4

Open
yangyyt opened this issue Mar 20, 2024 · 11 comments
Open

Model convergence and inference #4

yangyyt opened this issue Mar 20, 2024 · 11 comments

Comments

@yangyyt
Copy link

yangyyt commented Mar 20, 2024

Regarding model training and inference, I have a few questions that I would like to ask.

  1. The model has started training. How much will the loss finally converge to? The performance will be good.
  2. For model inference code, should I refer to eval.ipynb or eval_audio.ipynb? In eval.ipynb, I don't find model.tts, model.audio_model, model.duration_model;In eval_audio.ipynb, Do I need to train a vocoder model and then test it?

Thank you very much for your reply.

@ex3ndr
Copy link
Owner

ex3ndr commented Mar 20, 2024

In my experiments loss doesn't change at all and stuck ~0.3, but i can observe the quality change and it improves and improves the longer i train, the longest i have trained 400k iterations max on two GPUs so far, not sure what would be later (i expect it to be better).

eval_audio is modern one, i haven't updated main eval. Vocoder is pretrained, it is provided in eval notebooks

@yangyyt
Copy link
Author

yangyyt commented Mar 20, 2024

My training loss dropped from 2.x to 1.X, and I trained for 1200+ steps. I only used the data from libriTTS, don't know if it's normal.

@yangyyt
Copy link
Author

yangyyt commented Mar 21, 2024

In my experiments loss doesn't change at all and stuck ~0.3, but i can observe the quality change and it improves and improves the longer i train, the longest i have trained 400k iterations max on two GPUs so far, not sure what would be later (i expect it to be better).

eval_audio is modern one, i haven't updated main eval. Vocoder is pretrained, it is provided in eval notebooks

It has dropped to about 0.3 today. I will test it to see the effect.

@ex3ndr
Copy link
Owner

ex3ndr commented Mar 21, 2024

I have updated all code in eval notebook, also published how to use instructions

@yangyyt
Copy link
Author

yangyyt commented Mar 21, 2024

I have updated all code in eval notebook, also published how to use instructions

Thanks a lot, I used the eval_audio.ipynb file to test my model. I found that the effect was not as good as yours. I am going to check my model.

  1. I only used libriTTS data;
  2. I didn’t use the style feature;
    Not sure if these two have much impact, I'm going to see why.

@ex3ndr
Copy link
Owner

ex3ndr commented Mar 21, 2024

Style tokens (which are in fact just normalised pitch) improved emotional prosody a lot. Some of my notebooks has an example of inference without style tokens (I am training 10% without them to make it work).

@yangyyt
Copy link
Author

yangyyt commented Mar 23, 2024

Style tokens (which are in fact just normalised pitch) improved emotional prosody a lot. Some of my notebooks has an example of inference without style tokens (I am training 10% without them to make it work).

The generation is now normal. It seems that there is something wrong with the input of the audio model sample. and the log mel spec needs to be normalized (std: 2.1615, mean: -5.8843), but why is this step added? The spectrum was not normalized during model training. And how are std and mean calculated?
I have another question is, how was your voice_x.pt generated?

@ex3ndr
Copy link
Owner

ex3ndr commented Mar 23, 2024

It is normalized during training, this numbers are from voicebox paper, but i feel for my data they should be different, but i am not careful enough yet.
voice_x.pt is generated using generate_voices.py from the root of the repo.

@yangyyt
Copy link
Author

yangyyt commented Mar 23, 2024

got it, thank you.

@zvorinji
Copy link

zvorinji commented Apr 7, 2024

@ex3ndr have you thought of using the semantic model from WhisperSpeech?

@ex3ndr
Copy link
Owner

ex3ndr commented Apr 7, 2024

@zvorinji hey, i am not convinced that Whisper has anything useful, i tried in the past to use it's latent outputs to predict presence of the voice, but it turns out training from scratch was much easier task. wav2vec would be more reasonable alternative, but honestly semantic-wise it is enough to have phonemes with pitch.

What is really missing is emotions and non-semantic information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants