-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Word segmentation with mACPC? #1
Comments
Hi Jerome! Let me try to address your questions:
Let me know if I can be of further help :) |
Hi, thanks for the pointer. I later found that when obtaining the raw frame boundaries, I forgot to account for the fact that each discovery segment x in the second level corresponds to a frame length of y. Accounting for that fact gives segmentation results that at least spread across the 128 frames of a 20480-length raw audio segment. But unfortunately, I find that the word boundary precision and recall scores are only around 0.2 for the forced aligned dev set of LibriSpeech (https://github.com/CorentinJ/librispeech-alignments) as well as the buckeye test set, with an mACPC model trained for a total of 50 epochs on ls100. The boundary score was calculated somewhat differently as I read each audio individually (instead of using the data loader which reads in all the audio files and then cutting them into chunks), split it into a series of length-20480 chunks, and each chunk was fed into the model to obtain the boundaries. However, for each chunk, I only took the boundaries provided by the model and did not prepend and append each predicted boundary for the chunk by 0 and 128, respectively. In other words, I only obtained the "internal" boundaries of each chunk, concatenate them together and evaluate them against the ground truth ones (also without manually adding start-end boundaries and between-chunk boundaries). I tried the following ways to obtain the second-level (between-segment) boundaries
I usually see (2) to perform better, but it still cannot achieve precision and recall scores anywhere above 0.25 for the internal boundaries. Edit: I have modified the class MultiLevelModel(nn.Module) in models.py. Basically, I uncommented
and returned this variable so that I know how many frames each mean-reduced segment corresponds to. |
We didn't test it on LibriSpeech dev-set, but the results on Buckeye should agree. Some things that come to my mind that might be the reason for the discrepancy:
I haven't looked at your code yet, but the procedure you describe sounds solid. I'll take a look as soon as I have some time |
Hello, I am trying to use this repo to replicate the word segmentation results here (https://arxiv.org/pdf/2110.15909.pdf). I have some questions about this:
Thanks!
The text was updated successfully, but these errors were encountered: