Replies: 2 comments
-
whisper-timestamp tries to match actual speech timestamps, by looking at attention weights used by Whisper on the audio input to make its predictions. whisper-timestamped can help to reduce hallucinations, by removing words that occur at the end of a chunk and for which the estimated duration is zero. The second problem you describe can also be addressed by whisper-timestamped. |
Beta Was this translation helpful? Give feedback.
-
Thank you for your answer. Then it is definitely worth to look into it. The --vad option has been added very recently, isn't it? The start-at-zero-timestamp is also described here: openai/whisper#298. I noticed it myself when I tested transcriptions with the standard WhisperAI in this project: https://github.com/abdeladim-s/subsai. At first I thought it had to do something with my video material, but then I found this other report. My test is an episode of a TV series Fame 1982. The intro starts with instrumental music only and it takes some time before the first singer cuts in. Once the singer starts, transcriptions get back to normal. I was surprised because I found the Whisper results astonishing and assumed this to be easy to avoid, but apparently not. Like I said, the start timing is a minor problem, it is just nice to have if it is correct. Hallucinating is a bit more awkward when I am not watching myself only. Although that project offers whisper-timestamped as model option I did not manage to actually use whisper-timestamped yet. It seems there is something wrong in the way whisper-timestamped is called, I am still working on it. If I get any useful information I'll post back. |
Beta Was this translation helpful? Give feedback.
-
In whisper discussions "hallucinating" is described as the phenomenon that during periods of no speech previous transcriptions are repeated over and over.
The background might contain silence of background sounds.
Another (minor) issue is that whisper's first transcription timestamp is always at 00:00. For example when a video segment starts with intro music, the transcription shows the first speech, even if it starts at 02:00 or so.
Does whisper-timestamp try to match actual speech timestamps? Or does it just try to match whisper's transcription with sounds occurring at that moment? Which actually was the reason for whisper to issue that transcription?
Beta Was this translation helpful? Give feedback.
All reactions