Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIX INCLUDED] BatchedInferencePipeline.transcribe does not seem to use append_punctuations and preprend_punctuations properly #1252

Open
miroirdelame opened this issue Feb 20, 2025 · 0 comments

Comments

@miroirdelame
Copy link

Hello,

When using word timestamps, words with apostrophes end up split in two, like this (I'm french, so my examples will all be in french) :

  • Bonjour
  • j'
  • ai
  • perdu
  • l'
  • avion

Here is the Python script you can reproduce this behaviour : https://pastebin.com/qzLSnfnk
I can send you a sample MP3 file if needed.

This can be easily corrected by patching 2 lines in utils/transcribe.py :

Line 1874 in merge_punctuations
if previous["word"].startswith(" ") and previous["word"].strip() in prepended:
Should become
if previous["word"].startswith(" ") and any(previous["word"].strip().endswith(p) for p in prepended):

Line 1890 in merge_punctuations
if not previous["word"].endswith(" ") and following["word"] in appended:
Should become
if not previous["word"].endswith(" ") and any(following["word"].startswith(p) for p in appended):

I hope it won't break anything. XD

Thanks in advance for your review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant