option `--suppress_token` to reduce hallucinations / output special noise descriptions #107

misutoneko · 2023-06-23T16:54:11Z

misutoneko
Jun 23, 2023

Hi again,

I've now (finally) taken a peek at .words.json files, and it immediately paid off :D
I noticed that with medium model (with --language en), the first token is always 50364.
It's some kind of special token I guess, but I couldn't find any direct references, nor do I have any idea where it comes from.

Long story short, if I suppress this token that will ~~totally~~ mostly eradicate any hallucinations related to non-speech clips.
The clip will get a reasonable description of any noise or music instead => yay :D

So, is there a reason for this token to exist? Perhaps it should be suppressed by default.
I haven't noticed any downsides to suppressing it, but I guess it's possible that some utterances might go undetected if they genuinely contain this token.

EDIT:
It seems this token is the same in all the multilingual models.
For english-only models the token is 50363 (I didn't test the large ones though, they're probably the same).

EDIT2:
Looks like this might be dependent on what language is used. These results apply to english audio, as noted.

Jeronymous · 2023-06-23T17:28:33Z

Jeronymous
Jun 23, 2023
Maintainer

First here is a small piece of code to know what this token means:

import whisper
tokenizer = whisper.tokenizer.get_tokenizer(True, task="transcribe", language="en")
tokenizer.decode_with_timestamps([50364])
# Out[3]: '<|0.00|>'

So this tokens is the first timestamp token, meaning we are at time "0.00" and we want whisper decoder model to predict timestamps at the end of each segment.

Now I don't understand what you mean by "suppress this token" : how do you do this?

There is a mode in which Whisper model can predict the transcription without predicting timestamps.
I can imagine that it changes the behavior, and you seem to tell that it reduces hallucinations (which is good to know).
But then whisper-timestamped is unusable in that mode.
So again: what do you do exactly?

0 replies

misutoneko · 2023-06-23T19:01:39Z

misutoneko
Jun 23, 2023
Author

Thanks! OK that's interesting. Actually the only thing I did was to add the parameter --suppress_tokens 50364.
My use case is a bit special in that I do the transcripting for very short clips (as described in #82).

Here's the whole invocation (for a single sample):
CUDA_VISIBLE_DEVICES=-1 /usr/local/bin/whisper_timestamped --threads 2 --device cpu --output_format srt,json --language en --model medium --vad True --suppress_tokens 50364 --output_dir . clip_19_3127.wav

0 replies

Jeronymous · 2023-06-26T21:51:45Z

Jeronymous
Jun 26, 2023
Maintainer

Thanks a lot @misutoneko for the clarification.

Okay so it's getting really interesting.
What you actually do when you use "--suppress_tokens 50364" is doing two things

You suppress to Whisper the possibility to decode a segment start at <|0.00|>.
I played a bit with that, and saw that in practice, early starts will be predicted at <|0.02|>, <|0.04|>, ... instead
You allow Whisper to decode non speech tokens (which means special words like "*noise"). Because suppress_tokens is -1 by default, which means those tokens that do not correspond to text.
(note: if you want to suppress the prediction of 50364 and also of those non speech tokens, you can use "--suppress_tokens -1,50364")

It seems that the first point has no influence on hallucinations on silences, but that the second has.
This is a big discovery for me.
But it is still early for me to conclude and adapt the code to this. I am not used to see those special words coming in the transcription. I have to do some experiments.

Also this gave me an idea of making Whisper decode in the "<no timestamp>" mode (as I thought you were doing in my first comment) : this could also change Whisper behavior related to hallucinations and omissions. Because I guess that the training data used to train in that mode could be different from the Youtube subtitled videos used to train the "with timestamps" mode.
(reminder: all the predictions like "thanks for watching this video" that we can see on silences occur because of some subtitles biases in the training data).

0 replies

misutoneko · 2023-06-27T10:12:10Z

misutoneko
Jun 27, 2023
Author

Very nice, I knew you'd make sense of it :D
So it looks like I could've used --suppress_tokens "" just as well.
(The results will be slightly different of course, as it won't suppress anything then).

0 replies

Jeronymous · 2023-06-27T21:33:15Z

Jeronymous
Jun 27, 2023
Maintainer

Yes exactly.

On my side, I played a bit with suppress_tokens and I was disappointed.
In some case, it does not remove hallucinations (at least with "accurate" decoding, it seems you are using the more "efficient" one), and in some case it replaces hallucinations with those special words, that are not necessarily easier to filter out.
Those statistical models are unpredictable...

But I'm happy for you if you have a great experience with this.

0 replies

misutoneko · 2023-06-28T11:42:28Z

misutoneko
Jun 28, 2023
Author

Yes, one step closer to perfection. Will take a while to get there though...

Hmmm, any interesting (=problematic) samples that you could share?
There's quite a lot of factors that could have an impact, as you said it's quite unpredictable.
With an actual sample it's easier to pinpoint the problems.

EDIT: I just did some more testing and noticed that some timing issues were also fixed by using this option!
So yeah, it does look like a bug with the -1 default value. I've notified upstream also.

0 replies

guilhermehge · 2023-08-11T15:37:20Z

guilhermehge
Aug 11, 2023

I really like this approach, @misutoneko. I've been trying to deal with hallucinations for a while now, but suppressing tokens has never crossed my mind in this mean time. I've been experimenting with the approach that you mentined in #1488, the python code for it is suppress_tokens = []. Basically what you are doing with this is allowing all the tokens from the config.json to not be suppressed, and you are suppressing only 50364.

The problem with this is that, if you set suppress_tokens for an empty list, you're not going to be suppresing the special characters that the config.json file brings, some of my transcriptions are returning the "- " symbol before a text, so, I suppose, it can get unpredictable.

What I'm going to do now is check all tokens from the config.json file and see which ones are those background noises that are being suppressed and not suppress them, so the hallucinations that repeat previous tokens when silent may not happen, but, I will suppress some special characters that I don't want in my text, like the "- " mentioned above.

I am currently using faster-whisper

Thank you for your work.

0 replies

tophee · 2024-03-21T22:09:10Z

tophee
Mar 21, 2024

Looks like this might be dependent on what language is used. These results apply to english audio, as noted.

@misutoneko may I ask you how I can identify the token I want to suppress for languages other than English, e.g. Swedish?

3 replies

tophee Mar 22, 2024

OK, I learned some more now and understand that token-IDs are delivered right to my json output.

Based on that, however, I cannot confirm that the first token is always the same (in Swedish). If I just take the first couple of segments in my current file, the first token in those are 50615, 50715, 50765, 50815, 50965, respectively.

In that particular file, I am struggling with severe hallucinations. More than 2/3 of the transcript is a repetition of the same sentence ("Det är ju jättebra.") over and over again. Even in those segments the first token varies.

This is with large-v3.

misutoneko Mar 22, 2024
Author

Yeah I'm actually not sure if any of this can be applied to languages other than english.
I have tried some finnish samples and the smaller models were really struggling with them.
The spanish/italian/german ones that I tried seemed pretty good, but can't say much else about it...
I suppose it's to be expected, I mean just compare the reported WER scores for each of them.

The token ids are model-specific, defined in the file tokenizer.json (at huggingface).
Those 50xxx ids are special ones I guess... the code snippet in the second post can be used to decode them.

Also, I haven't used large-v3 myself but I thought that was a "known-bad" one?

tophee Mar 22, 2024

Also, I haven't used large-v3 myself but I thought that was a "known-bad" one?

I know, it hallucinates more than v2 but apart from that (if that makes any sense), it is supposedly better. I am currently using it because it produces more of what I want to elliminate, which helps with testing. If it turns out to be too bad, I'll switch to v2.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

option `--suppress_token` to reduce hallucinations / output special noise descriptions #107

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 8 comments 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

option --suppress_token to reduce hallucinations / output special noise descriptions #107

misutoneko Jun 23, 2023

Replies: 8 comments · 3 replies

Jeronymous Jun 23, 2023 Maintainer

misutoneko Jun 23, 2023 Author

Jeronymous Jun 26, 2023 Maintainer

misutoneko Jun 27, 2023 Author

Jeronymous Jun 27, 2023 Maintainer

misutoneko Jun 28, 2023 Author

guilhermehge Aug 11, 2023

tophee Mar 21, 2024

tophee Mar 22, 2024

misutoneko Mar 22, 2024 Author

tophee Mar 22, 2024

option `--suppress_token` to reduce hallucinations / output special noise descriptions #107

misutoneko
Jun 23, 2023

Replies: 8 comments 3 replies

Jeronymous
Jun 23, 2023
Maintainer

misutoneko
Jun 23, 2023
Author

Jeronymous
Jun 26, 2023
Maintainer

misutoneko
Jun 27, 2023
Author

Jeronymous
Jun 27, 2023
Maintainer

misutoneko
Jun 28, 2023
Author

guilhermehge
Aug 11, 2023

tophee
Mar 21, 2024

misutoneko Mar 22, 2024
Author