[FIX INCLUDED] BatchedInferencePipeline.transcribe does not seem to use append_punctuations and preprend_punctuations properly #1252

miroirdelame · 2025-02-20T21:05:19Z

Hello,

When using word timestamps, words with apostrophes end up split in two, like this (I'm french, so my examples will all be in french) :

Bonjour
j'
ai
perdu
l'
avion

Here is the Python script you can reproduce this behaviour : https://pastebin.com/qzLSnfnk
I can send you a sample MP3 file if needed.

This can be easily corrected by patching 2 lines in utils/transcribe.py :

Line 1874 in merge_punctuations
if previous["word"].startswith(" ") and previous["word"].strip() in prepended:
Should become
if previous["word"].startswith(" ") and any(previous["word"].strip().endswith(p) for p in prepended):

Line 1890 in merge_punctuations
if not previous["word"].endswith(" ") and following["word"] in appended:
Should become
if not previous["word"].endswith(" ") and any(following["word"].startswith(p) for p in appended):

I hope it won't break anything. XD

Thanks in advance for your review.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FIX INCLUDED] BatchedInferencePipeline.transcribe does not seem to use append_punctuations and preprend_punctuations properly #1252

[FIX INCLUDED] BatchedInferencePipeline.transcribe does not seem to use append_punctuations and preprend_punctuations properly #1252

miroirdelame commented Feb 20, 2025

[FIX INCLUDED] BatchedInferencePipeline.transcribe does not seem to use append_punctuations and preprend_punctuations properly #1252

[FIX INCLUDED] BatchedInferencePipeline.transcribe does not seem to use append_punctuations and preprend_punctuations properly #1252

Comments

miroirdelame commented Feb 20, 2025