Use accurate transcription to align #49

RaulKite · 2023-03-07T11:52:04Z

RaulKite
Mar 7, 2023

Hi,

I know that there is a method to just align words when I have an accurate transcription of the audio.

Even, I'm quite sure that I have seen anywhere the way to do that with Python but I'm not able to found it again.

Can someone point me the way to do that?

Thanks

dgoryeo · 2023-03-07T13:04:38Z

dgoryeo
Mar 7, 2023

These projects (below) are all using a variation of the alignment approach. I think the first one is the one that does exactly what you need:

https://github.com/EtienneAb3d/WhisperTimeSync
https://github.com/m-bain/whisperX/blob/main/whisperx/alignment.py
https://github.com/dmarx/video-killed-the-radio-star

0 replies

RaulKite · 2023-03-07T14:09:51Z

RaulKite
Mar 7, 2023
Author

These projects (below) are all using a variation of the alignment approach. I think the first one is the one that does exactly what you need:

https://github.com/EtienneAb3d/WhisperTimeSync

https://github.com/m-bain/whisperX/blob/main/whisperx/alignment.py

https://github.com/dmarx/video-killed-the-radio-star

Thanks. I know that protects but them have some inconveniences for me.

I'm looking for the way of doing that with this project.

Thanks again

0 replies

Jeronymous · 2023-03-07T14:56:17Z

Jeronymous
Mar 7, 2023
Maintainer

Thank you @RaulKite for your loyalty :)

Indeed in theory it's possible to use the same approach as whisper-timestamped (i.e. Whisper models with their cross-attention weights) to align a given transcription of an audio (even if that transcription was not produced by whisper)
Such a suggestion came up recently in issue #40.

It requires to reorganize a bit the code, which is not a big deal.
But it also raises the question of the format of the "accurate" transcription.
Currently, whisper-timestamped is based on whisper transcription which includes segments (of maximum 30 sec) and hints about where do these segments start and end in the audio.
Having something that works when a transcription is given for a 1H audio, without any clue about where is what, requires more thinking / work.

So @RaulKite, what transcription format would you give for an audio?

EDIT: whisper transcription also include the (detected?) language. Is it an information that you would like to provide or want it to be automatic?

2 replies

dgoryeo Mar 7, 2023

To add to that, the alignment approach makes the solution more language dependent --less universal, for example Asian languages.

RaulKite Mar 8, 2023
Author

Thank you @RaulKite for your loyalty :)

Indeed in theory it's possible to use the same approach as whisper-timestamped (i.e. Whisper models with their cross-attention weights) to align a given transcription of an audio (even if that transcription was not produced by whisper) Such a suggestion came up recently in issue #40.

It requires to reorganize a bit the code, which is not a big deal. But it also raises the question of the format of the "accurate" transcription. Currently, whisper-timestamped is based on whisper transcription which includes segments (of maximum 30 sec) and hints about where do these segments start and end in the audio. Having something that works when a transcription is given for a 1H audio, without any clue about where is what, requires more thinking / work.

So @RaulKite, what transcription format would you give for an audio?

EDIT: whisper transcription also include the (detected?) language. Is it an information that you would like to provide or want it to be automatic?

My idea was to use WhisperHallu ( https://github.com/EtienneAb3d/WhisperHallu ) to minimize hallucinations and then whisper-timestamped to align word timestamps.

Regards

Raúl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use accurate transcription to align #49

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Use accurate transcription to align #49

RaulKite Mar 7, 2023

Replies: 3 comments · 2 replies

dgoryeo Mar 7, 2023

RaulKite Mar 7, 2023 Author

Jeronymous Mar 7, 2023 Maintainer

dgoryeo Mar 7, 2023

RaulKite Mar 8, 2023 Author

RaulKite
Mar 7, 2023

Replies: 3 comments 2 replies

dgoryeo
Mar 7, 2023

RaulKite
Mar 7, 2023
Author

Jeronymous
Mar 7, 2023
Maintainer

RaulKite Mar 8, 2023
Author