New speaker encoder #447

loganhart02 · 2021-04-22T22:15:11Z

loganhart02
Apr 22, 2021

Hey guys! I've been working on trying to improve the multi-speaker models and I got access to a spotify podcast dataset(It is 2TB of several podcast). I haven't really got dirty with messing around with it yet because of the startup I'm founding so I'm not sure how many speakers there really are but I know that these are full length podcast and from what I've listened to they are of good quality too. Do you guys think that if we trained a new speaker encoder that it would improve on the current one to get a better multi-speaker TTS system? Wanted to ask before I dived in deep into trying to preprocess this massive dataset.

erogol · 2021-04-23T01:32:04Z

erogol
Apr 23, 2021
Maintainer

I think it'd be definitely valuable for some use cases. Current speaker encoder is trained by @mueller91, so he might comment on it better.

0 replies

mueller91 · 2021-04-26T11:00:32Z

mueller91
Apr 26, 2021

For a good speaker encoder, a high number of different speakers in the dataset is critical. How many speakers does your dataset contain?

9 replies

Edresson May 28, 2021

I think that using this dataset it is a little difficult to separate the speakers. The separation of the speakers is very important for the loss functions. I would suggest using datasets that already have speaker separation as common voice.

I am training a new speaker encoder. See PR #508
I'm training with 53k speakers (Common voice all languages + voxceleb1 + voxceleb2)

The model has been training for 3 days and the results are already better than our current speaker encoder :).

astricks Jun 14, 2021

@Edresson is the speaker encoder language-independant? For some reason, I was under the impression it is constrained to the English language.

Edresson Jun 14, 2021

@astricks In theory the speaker encoder should not work with similar performance in any language. During training our speaker encoder saw much more audio in English without any balance, but it should work. Training with multiple languages can help (especially given the number of extra speakers and new phonemes).

astricks Jun 15, 2021

Interesting.. I’ll try it with Hindi and see how that works. Looks like you trained with all of Mozilla common language so the encoder has seen some Hindi speakers.

astricks Jun 15, 2021

I was also wondering - would I be able to effectively use the speaker encoder to generate speaker embeddings on my dataset even if I am using a higher sample rate, say, 48k?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New speaker encoder #447

{{title}}

Replies: 2 comments 9 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

New speaker encoder #447

loganhart02 Apr 22, 2021

Replies: 2 comments · 9 replies

erogol Apr 23, 2021 Maintainer

mueller91 Apr 26, 2021

Edresson May 28, 2021

astricks Jun 14, 2021

Edresson Jun 14, 2021

astricks Jun 15, 2021

astricks Jun 15, 2021

loganhart02
Apr 22, 2021

Replies: 2 comments 9 replies

erogol
Apr 23, 2021
Maintainer

mueller91
Apr 26, 2021