Detect language and transcribe in separate steps #1245

pablopla · 2025-02-12T17:40:11Z

Is it possible to detect the language in the audio file and transcribe it in separate steps?
I have a fine tuned model for a specific language. I'm trying to detect the language, use the fine tuned model if the language match or the general model otherwise.
openai/whisper has an example how to do it in the README but I couldn't find equivalent in faster-whisper:

mel = whisper.log_mel_spectrogram(audio, n_mels=model.dims.n_mels).to(model.device)
_, probs = model.detect_language(mel)
print(f"Detected language: {max(probs, key=probs.get)}")

heimoshuiyu · 2025-02-13T01:20:34Z

yes you can

faster-whisper/faster_whisper/transcribe.py

Lines 1726 to 1754 in 9e657b4

    
               def detect_language( 
        
                   self, 
        
                   audio: Optional[np.ndarray] = None, 
        
                   features: Optional[np.ndarray] = None, 
        
                   vad_filter: bool = False, 
        
                   vad_parameters: Union[dict, VadOptions] = None, 
        
                   language_detection_segments: int = 1, 
        
                   language_detection_threshold: float = 0.5, 
        
               ) -> Tuple[str, float, List[Tuple[str, float]]]: 
        
                   """ 
        
                   Use Whisper to detect the language of the input audio or features. 
        
                   Arguments: 
        
                       audio: Input audio signal, must be a 1D float array sampled at 16khz. 
        
                       features: Input Mel spectrogram features, must be a float array with 
        
                           shape (n_mels, n_frames), if `audio` is provided, the features will be ignored. 
        
                           Either `audio` or `features` must be provided. 
        
                       vad_filter: Enable the voice activity detection (VAD) to filter out parts of the audio 
        
                           without speech. This step is using the Silero VAD model. 
        
                       vad_parameters: Dictionary of Silero VAD parameters or VadOptions class (see available 
        
                           parameters and default values in the class `VadOptions`). 
        
                       language_detection_threshold: If the maximum probability of the language tokens is 
        
                           higher than this value, the language is detected. 
        
                       language_detection_segments: Number of segments to consider for the language detection. 
        
                   Returns: 
        
                       language: Detected language. 
        
                       languege_probability: Probability of the detected language. 
        
                       all_language_probs: List of tuples with all language names and probabilities.

pablopla · 2025-02-13T06:40:10Z

@heimoshuiyu detect_language expects audio or features arguments as np.ndarray. There is no method to get the audio or features in the required format.

namtacs · 2025-02-17T10:13:15Z

There is an obvious way to decode audio - just look at the top of the transcribe method:

faster-whisper/faster_whisper/transcribe.py

Lines 824 to 834 in 9e657b4

    
           sampling_rate = self.feature_extractor.sampling_rate 
        
           if multilingual and not self.model.is_multilingual: 
        
               self.logger.warning( 
        
                   "The current model is English-only but the multilingual parameter is set to" 
        
                   "True; setting to False instead." 
        
               ) 
        
               multilingual = False 
        
           if not isinstance(audio, np.ndarray): 
        
               audio = decode_audio(audio, sampling_rate=sampling_rate)

All you need to do is:

from faster_whisper.audio import decode_audio
audiofile = "audio.mp3" # can be BytesIO (binary file object)
audio = decode_audio(audio, sampling_rate=model.feature_extractor.sampling_rate)
language, language_probability, all_language_probs = model.detect_language(audio)
print(f"Detected language: {language}")

Note: this considers only the first 30 second segment of the audio. Do model.detect_language(audio,language_detection_segments=SEGMENTS_NUM) to specify amount.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detect language and transcribe in separate steps #1245

Detect language and transcribe in separate steps #1245

pablopla commented Feb 12, 2025

heimoshuiyu commented Feb 13, 2025

pablopla commented Feb 13, 2025

namtacs commented Feb 17, 2025

Detect language and transcribe in separate steps #1245

Detect language and transcribe in separate steps #1245

Comments

pablopla commented Feb 12, 2025

heimoshuiyu commented Feb 13, 2025

pablopla commented Feb 13, 2025

namtacs commented Feb 17, 2025