option --suppress_token
to reduce hallucinations / output special noise descriptions
#107
Replies: 8 comments 3 replies
-
First here is a small piece of code to know what this token means: import whisper
tokenizer = whisper.tokenizer.get_tokenizer(True, task="transcribe", language="en")
tokenizer.decode_with_timestamps([50364])
# Out[3]: '<|0.00|>' So this tokens is the first timestamp token, meaning we are at time "0.00" and we want whisper decoder model to predict timestamps at the end of each segment. Now I don't understand what you mean by "suppress this token" : how do you do this? There is a mode in which Whisper model can predict the transcription without predicting timestamps. |
Beta Was this translation helpful? Give feedback.
-
Thanks! OK that's interesting. Actually the only thing I did was to add the parameter --suppress_tokens 50364. Here's the whole invocation (for a single sample): |
Beta Was this translation helpful? Give feedback.
-
Thanks a lot @misutoneko for the clarification. Okay so it's getting really interesting.
It seems that the first point has no influence on hallucinations on silences, but that the second has. Also this gave me an idea of making Whisper decode in the " |
Beta Was this translation helpful? Give feedback.
-
Very nice, I knew you'd make sense of it :D |
Beta Was this translation helpful? Give feedback.
-
Yes exactly. On my side, I played a bit with But I'm happy for you if you have a great experience with this. |
Beta Was this translation helpful? Give feedback.
-
Yes, one step closer to perfection. Will take a while to get there though... Hmmm, any interesting (=problematic) samples that you could share? EDIT: I just did some more testing and noticed that some timing issues were also fixed by using this option! |
Beta Was this translation helpful? Give feedback.
-
I really like this approach, @misutoneko. I've been trying to deal with hallucinations for a while now, but suppressing tokens has never crossed my mind in this mean time. I've been experimenting with the approach that you mentined in #1488, the python code for it is suppress_tokens = []. Basically what you are doing with this is allowing all the tokens from the config.json to not be suppressed, and you are suppressing only 50364. The problem with this is that, if you set suppress_tokens for an empty list, you're not going to be suppresing the special characters that the config.json file brings, some of my transcriptions are returning the "- " symbol before a text, so, I suppose, it can get unpredictable. What I'm going to do now is check all tokens from the config.json file and see which ones are those background noises that are being suppressed and not suppress them, so the hallucinations that repeat previous tokens when silent may not happen, but, I will suppress some special characters that I don't want in my text, like the "- " mentioned above. I am currently using faster-whisper Thank you for your work. |
Beta Was this translation helpful? Give feedback.
-
@misutoneko may I ask you how I can identify the token I want to suppress for languages other than English, e.g. Swedish? |
Beta Was this translation helpful? Give feedback.
-
Hi again,
I've now (finally) taken a peek at .words.json files, and it immediately paid off :D
I noticed that with medium model (with --language en), the first token is always 50364.
It's some kind of special token I guess, but I couldn't find any direct references, nor do I have any idea where it comes from.
Long story short, if I suppress this token that will
totallymostly eradicate any hallucinations related to non-speech clips.The clip will get a reasonable description of any noise or music instead => yay :D
So, is there a reason for this token to exist? Perhaps it should be suppressed by default.
I haven't noticed any downsides to suppressing it, but I guess it's possible that some utterances might go undetected if they genuinely contain this token.
EDIT:
It seems this token is the same in all the multilingual models.
For english-only models the token is 50363 (I didn't test the large ones though, they're probably the same).
EDIT2:
Looks like this might be dependent on what language is used. These results apply to english audio, as noted.
Beta Was this translation helpful? Give feedback.
All reactions