Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect language and transcribe in separate steps #1245

Open
pablopla opened this issue Feb 12, 2025 · 3 comments
Open

Detect language and transcribe in separate steps #1245

pablopla opened this issue Feb 12, 2025 · 3 comments

Comments

@pablopla
Copy link

Is it possible to detect the language in the audio file and transcribe it in separate steps?
I have a fine tuned model for a specific language. I'm trying to detect the language, use the fine tuned model if the language match or the general model otherwise.
openai/whisper has an example how to do it in the README but I couldn't find equivalent in faster-whisper:

mel = whisper.log_mel_spectrogram(audio, n_mels=model.dims.n_mels).to(model.device)
_, probs = model.detect_language(mel)
print(f"Detected language: {max(probs, key=probs.get)}")
@heimoshuiyu
Copy link
Contributor

yes you can

def detect_language(
self,
audio: Optional[np.ndarray] = None,
features: Optional[np.ndarray] = None,
vad_filter: bool = False,
vad_parameters: Union[dict, VadOptions] = None,
language_detection_segments: int = 1,
language_detection_threshold: float = 0.5,
) -> Tuple[str, float, List[Tuple[str, float]]]:
"""
Use Whisper to detect the language of the input audio or features.
Arguments:
audio: Input audio signal, must be a 1D float array sampled at 16khz.
features: Input Mel spectrogram features, must be a float array with
shape (n_mels, n_frames), if `audio` is provided, the features will be ignored.
Either `audio` or `features` must be provided.
vad_filter: Enable the voice activity detection (VAD) to filter out parts of the audio
without speech. This step is using the Silero VAD model.
vad_parameters: Dictionary of Silero VAD parameters or VadOptions class (see available
parameters and default values in the class `VadOptions`).
language_detection_threshold: If the maximum probability of the language tokens is
higher than this value, the language is detected.
language_detection_segments: Number of segments to consider for the language detection.
Returns:
language: Detected language.
languege_probability: Probability of the detected language.
all_language_probs: List of tuples with all language names and probabilities.

@pablopla
Copy link
Author

@heimoshuiyu detect_language expects audio or features arguments as np.ndarray. There is no method to get the audio or features in the required format.

@namtacs
Copy link

namtacs commented Feb 17, 2025

There is an obvious way to decode audio - just look at the top of the transcribe method:

sampling_rate = self.feature_extractor.sampling_rate
if multilingual and not self.model.is_multilingual:
self.logger.warning(
"The current model is English-only but the multilingual parameter is set to"
"True; setting to False instead."
)
multilingual = False
if not isinstance(audio, np.ndarray):
audio = decode_audio(audio, sampling_rate=sampling_rate)

All you need to do is:

from faster_whisper.audio import decode_audio
audiofile = "audio.mp3" # can be BytesIO (binary file object)
audio = decode_audio(audio, sampling_rate=model.feature_extractor.sampling_rate)
language, language_probability, all_language_probs = model.detect_language(audio)
print(f"Detected language: {language}")

Note: this considers only the first 30 second segment of the audio. Do model.detect_language(audio,language_detection_segments=SEGMENTS_NUM) to specify amount.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants