You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The V2 model is very good, especially since it supports laughs and other non-word sounds, but it's still a little bit behind the best paid models. Are there any plans for a V3 release? If so, can you provide an ETA for its release and what can we expect from it? For example, will it add support for other languages or significantly increase the training data?
Thank you very much for the project.
The text was updated successfully, but these errors were encountered:
Data: about 8k hours chinese data, 7k hours other language data.
Better zero shot TTS timbre similarity (speaker verification distance). Better audio quality(MOS).
Richer emotional expression. Consistent stability (WER) comparing with v2.
The experiment has been successful, and I am expanding the training data.
Inference time: maybe slightly slower.
Time of new release is about January.
The V2 model is very good, especially since it supports laughs and other non-word sounds, but it's still a little bit behind the best paid models. Are there any plans for a V3 release? If so, can you provide an ETA for its release and what can we expect from it? For example, will it add support for other languages or significantly increase the training data?
Thank you very much for the project.
The text was updated successfully, but these errors were encountered: