Replies: 1 comment
-
[Archived] Additional ideas around dataset and tts output testing
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
>>> nmstoker
[August 10, 2020, 1:13pm]
There are already some handy tools in the repo for looking at dataset
issues (eg
here and
here).
However for a while I've had an idea in the back of my mind about
looking into the syllables present in the audio and comparing that to
the transcript text to highlight discrepancies, and seeing if it could
be semi-automated to save time.
There are various ways you could do this, including with use of speech
recognition on the audio side, but I identified an approach for the
audio that works tolerably well (it's not perfect but seems to work
reasonably well).
## It's presented in a Gist here: https://gist.github.com/nmstoker/f1590847a16b66ab22c16722aac1cc51
If people think it might be useful added to the repo, I'm happy to do a
PR.
It uses a library called
parselmouth in turn
calling a Praat script for the audio. For the text syllables there is a
handy little library called
syllapy
I ran it on LJSpeech 1.1 as
that's what people often use here at least for experimentation. That
dataset is a well produced dataset, but it actually did identify one
particular case with a clear problem.
For new / private / self-produced datasets this could be a very useful
way to avoiding the need to manually inspect each audio/transcript pair.
At the very least it lets you initially target such efforts.
And there is also scope on running it on audio output from TTS to see
that there aren't cases of repeating words (ie as often happens when
there are stopnet issues). You could create a large-ish batch of new
transcript sentences to test, fire these at TTS using requests to create
the audio files and then run the comparison between the audio and
transcript to focus on problem cases.
With Praat, there are potentially options to go a bit further than
purely syllables (eg to use their 'voice report' (some details
),
so if people have feedback or suggestions before adding this, do fire
away
[This is an archived TTS discussion thread from discourse.mozilla.org/t/additional-ideas-around-dataset-and-tts-output-testing]
Beta Was this translation helpful? Give feedback.
All reactions