Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to handle non-verbal (eg: laughter, wheezing, crying) audio in text preprocessing #838

Open
EobardThawne721 opened this issue Jan 6, 2025 · 3 comments

Comments

@EobardThawne721
Copy link

If there is an audio clip that starts with a laugh and is followed by normal speech text, how should I handle the preceding laughter? For example, if I want to manually mark the preceding laughter as eg: "[laugh] Ha ha ha, that's funny!" How should I handle the preceding laughter marking using normal G2P
I have seen that in the past, vits or other multilingual models, if they want to speak both Chinese and English at the same time, their common practice is this: eg: [ZH] Chinese [ZH] [EN] hello world [EN], and then when using G2P mapping as a marker, if they encounter ZH, they use the Chinese processing method, and if they encounter EN, they use the English processing method. So is it possible to do the same for [laugh]

@EobardThawne721
Copy link
Author

Is there a relatively simple way to directly process laughter or wheezing sounds using the 'laugh' method, similar to the processing method of the multilingual TTS model

@aluminumbox
Copy link
Collaborator

our instruct data are all human labeled

@EobardThawne721
Copy link
Author

How should i handle it?I haven't dealt with this type of label before. Can you provide a reference for how to handle it? Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants