My progress on expressive speech synthesis #89

JRMeyer · 2021-03-07T08:07:25Z

JRMeyer
Mar 7, 2021
Maintainer

>>> geneing
[September 14, 2019, 8:16pm]

I implemented the method of predicting style tokens from text alone as
described in this paper. The method
works, and the effect, while subtle, is that of a more expressive
speech. Here's an example after less than 100K steps. Sound
samples.
Check for example TestSentence_1.wav vs TestSentence_GST_1.wav.

The pairs of test sentences are generated by the same tacotron network.
For the GST wav file, the style tokens were generated by a separate
network that takes tacotron encoder output and produces style tokens.
The non GST file was generated with the style token set to zero.

[Any step by step how-to/documentation on synthesizing with a
pre-trained model?

[This is an archived TTS discussion thread from discourse.mozilla.org/t/my-progress-on-expressive-speech-synthesis]

JRMeyer · 2021-03-07T08:07:28Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> geneing
[September 17, 2019, 4:45am]

Authors of GST papers seem to like treating everything as attention. In
particular style tokens are used as 'attention' (i.e. style embedding is
added to the encoder output). This means that decoder has to disentangle
style and encoded text.

I tried simply concatenating encoder output and style net output and it
works just fine. Can't tell if it's better, but I don't see why one
would ever incorporate sytle the 'attention' way.

[Archived Post]

0 replies

JRMeyer · 2021-03-07T08:07:31Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> erogol
[September 18, 2019, 2:22pm]

Sounds quite good. Is this your own dataset?

[Archived Post]

0 replies

JRMeyer · 2021-03-07T08:07:33Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> geneing
[September 18, 2019, 6:16pm]

I use mailabs dataset for the training (mary_ann reader). It has a good
quality recording and mary_ann is a dynamic reader with a nice voice.

There is a large number of librivox recordings by mary_ann available.
I'm working a bit on writing a script to align more of her recordings,
for use in training style token prediction from text. However, it's
going slowly because of the difficulty detecting all the corner cases
where forced aligner fails.

[Archived Post]

0 replies

JRMeyer · 2021-03-07T08:07:36Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> erogol
[September 18, 2019, 7:56pm]

You can also try https://github.com/mozilla/DSAlign

[Archived Post]

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

My progress on expressive speech synthesis #89

{{title}}

Replies: 4 comments

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

My progress on expressive speech synthesis #89

JRMeyer Mar 7, 2021 Maintainer

Replies: 4 comments

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer

JRMeyer
Mar 7, 2021
Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer Author