Does / can whisper-timestamped help with whisper hallucinations? #58

jlinkels · 2023-03-11T22:26:38Z

jlinkels
Mar 11, 2023

In whisper discussions "hallucinating" is described as the phenomenon that during periods of no speech previous transcriptions are repeated over and over.

The background might contain silence of background sounds.

Another (minor) issue is that whisper's first transcription timestamp is always at 00:00. For example when a video segment starts with intro music, the transcription shows the first speech, even if it starts at 02:00 or so.

Does whisper-timestamp try to match actual speech timestamps? Or does it just try to match whisper's transcription with sounds occurring at that moment? Which actually was the reason for whisper to issue that transcription?

Jeronymous · 2023-03-13T07:55:53Z

Jeronymous
Mar 13, 2023
Maintainer

whisper-timestamp tries to match actual speech timestamps, by looking at attention weights used by Whisper on the audio input to make its predictions.

whisper-timestamped can help to reduce hallucinations, by removing words that occur at the end of a chunk and for which the estimated duration is zero.
Additional heuristics could be used to help further.

The second problem you describe can also be addressed by whisper-timestamped.
Options --vad and --detect_disfluencies.
But I never saw such a problem with Whisper (even on audio with music at first), so it deserves to be tested on your use cases.

0 replies

jlinkels · 2023-03-13T19:11:09Z

jlinkels
Mar 13, 2023
Author

Thank you for your answer. Then it is definitely worth to look into it. The --vad option has been added very recently, isn't it?

The start-at-zero-timestamp is also described here: openai/whisper#298. I noticed it myself when I tested transcriptions with the standard WhisperAI in this project: https://github.com/abdeladim-s/subsai.

At first I thought it had to do something with my video material, but then I found this other report. My test is an episode of a TV series Fame 1982. The intro starts with instrumental music only and it takes some time before the first singer cuts in. Once the singer starts, transcriptions get back to normal. I was surprised because I found the Whisper results astonishing and assumed this to be easy to avoid, but apparently not.

Like I said, the start timing is a minor problem, it is just nice to have if it is correct. Hallucinating is a bit more awkward when I am not watching myself only.

Although that project offers whisper-timestamped as model option I did not manage to actually use whisper-timestamped yet. It seems there is something wrong in the way whisper-timestamped is called, I am still working on it. If I get any useful information I'll post back.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does / can whisper-timestamped help with whisper hallucinations? #58

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Does / can whisper-timestamped help with whisper hallucinations? #58

jlinkels Mar 11, 2023

Replies: 2 comments

Jeronymous Mar 13, 2023 Maintainer

jlinkels Mar 13, 2023 Author

jlinkels
Mar 11, 2023

Jeronymous
Mar 13, 2023
Maintainer

jlinkels
Mar 13, 2023
Author