-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Model convergence and inference #4
Comments
In my experiments loss doesn't change at all and stuck ~
|
My training loss dropped from 2.x to 1.X, and I trained for 1200+ steps. I only used the data from libriTTS, don't know if it's normal. |
It has dropped to about 0.3 today. I will test it to see the effect. |
I have updated all code in eval notebook, also published how to use instructions |
Thanks a lot, I used the eval_audio.ipynb file to test my model. I found that the effect was not as good as yours. I am going to check my model.
|
Style tokens (which are in fact just normalised pitch) improved emotional prosody a lot. Some of my notebooks has an example of inference without style tokens (I am training 10% without them to make it work). |
The generation is now normal. It seems that there is something wrong with the input of the audio model sample. and the log mel spec needs to be normalized (std: 2.1615, mean: -5.8843), but why is this step added? The spectrum was not normalized during model training. And how are std and mean calculated? |
It is normalized during training, this numbers are from voicebox paper, but i feel for my data they should be different, but i am not careful enough yet. |
got it, thank you. |
@ex3ndr have you thought of using the semantic model from WhisperSpeech? |
@zvorinji hey, i am not convinced that Whisper has anything useful, i tried in the past to use it's latent outputs to predict presence of the voice, but it turns out training from scratch was much easier task. wav2vec would be more reasonable alternative, but honestly semantic-wise it is enough to have phonemes with pitch. What is really missing is emotions and non-semantic information. |
Regarding model training and inference, I have a few questions that I would like to ask.
Thank you very much for your reply.
The text was updated successfully, but these errors were encountered: