-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Build model for Vosk #1690
Comments
Looks ok |
I want to create a speech recognition model for numbers from 0 to 13 in Arabic using Google Colab. |
You can use our existing colab https://github.com/alphacep/vosk-api/blob/master/python/example/colab/vosk-training.ipynb |
Are these steps correct after installing vosk-training.ipynb? Import necessary librariesimport os Install required libraries!pip install --upgrade g2p-en pandas Mount Google Driveprint(" Mounting Google Drive...") Clone Vosk recipesprint(" Cloning Vosk recipes...") Prepare Vosk environmenttry: Define pathsROOT_DIR = "/content/drive/MyDrive/custom_vosk" Create directoriesos.makedirs(WAV_DIR, exist_ok=True) Copy audio filesprint(" Copying audio files...") Read and process transcripts fileprint(" Reading transcripts file...") Create processed transcripts fileprint("✍️ Creating processed transcripts file...") Text cleaning functiondef clean_text(text): Verify audio files formatprint(" Verifying audio files format...") if audio_errors: Prepare lexicon fileprint(" Preparing lexicon file...") Extract and clean words from transcriptswith open(TRANSCRIPT_TXT, "r", encoding="utf-8") as f: Create lexicon filewith open(LEXICON_TXT, "w", encoding="utf-8") as f, open(FAILED_WORDS_TXT, "w", encoding="utf-8") as failed: print("✅ Lexicon file prepared.") Train the modelprint(" Starting model training...") Evaluation stepsprint(" Model evaluation steps:") |
I want to create a speech recognition model for numbers from 0 to 13 in Arabic using Google Colab. Are these steps correct?
Installing Basic Tools
First, we need to install essential tools like build-essential, gfortran, and sox that are necessary for building tools in Kaldi and Vosk.
In the first cell in Google Colab, input the following commands:
!apt-get install -y build-essential gfortran
!apt-get install -y sox
!apt-get install -y python3-pip
!pip install kaldi-python
Installing Kaldi
After installing the basic tools, we will install Kaldi from GitHub using the following commands:
!git clone https://github.com/kaldi-asr/kaldi.git
%cd kaldi/tools
!make
%cd ../src
!./configure --use-cuda=no
!make
In this step, we need to upload audio files containing the numbers 0 to 13 and prepare the data in the required files.
Upload Audio Files (Numbers 0 to 13)
Upload your audio files to Google Colab. You can upload files through Colab's "Files" interface.
Example of file paths (upload the files under the audio/ folder):
audio/zero.wav
audio/one.wav
audio/two.wav
...
audio/thirteen.wav
Preparing the Required Files
We will create 3 main files: text, wav.scp, and utt2spk.
In this file, we write the utterance ID and the corresponding sentence. Each utterance ID will be unique, like speaker-0, speaker-1, and so on.
The content will look like this:
speaker-0 صفر
speaker-1 واحد
speaker-2 اثنان
speaker-3 ثلاثة
speaker-4 أربعة
speaker-5 خمسة
speaker-6 ستة
speaker-7 سبعة
speaker-8 ثمانية
speaker-9 تسعة
speaker-10 عشرة
speaker-11 أحد عشر
speaker-12 اثنا عشر
speaker-13 ثلاثة عشر
This file contains the path to each audio file along with the corresponding utterance-id. For example:
speaker-0 /path/to/audio/zero.wav
speaker-1 /path/to/audio/one.wav
speaker-2 /path/to/audio/two.wav
speaker-3 /path/to/audio/three.wav
...
speaker-13 /path/to/audio/thirteen.wav
In this file, we link each utterance-id to the speaker's name. In this case, it's always speaker.
speaker-0 speaker
speaker-1 speaker
speaker-2 speaker
speaker-3 speaker
...
speaker-13 speaker
Creating the lexicon.txt File
In this file, you will need to write each word and its corresponding phonemes (in your case, the numbers 0 to 13). We need to use the phonemes for Arabic numbers, which are similar to those in the audio files.
The content will look like this:
صفر s ˈf r
واحد w ʌ h i d
اثنان ʔ t h aː n
ثلاثة t h l aː t a
أربعة ʔ r b ʕ a
خمسة x aː m s a
ستة s i t t a
سبعة s aː b ʕ a
ثمانية t h m aː n iː a
تسعة t s ʕ a
عشرة ʕ ʃ a r a
أحد عشر ʔ h d ʔ aʃ a r
اثنا عشر ʔ t h n aʃ a r
ثلاثة عشر t l aʕ t aʃ a r
Creating nonsilence_phones.txt and silence_phones.txt
nonsilence_phones.txt: This contains all the non-silent phonemes. You can extract them from lexicon.txt:
cut -d ' ' -f 2- lexicon.txt | sed 's/ /\n/g' | sort -u > nonsilence_phones.txt
silence_phones.txt: This contains the silent phonemes. In this case, the content could be:
echo -e 'SIL\noov\nSPN' > silence_phones.txt
Preparing the Language Directory
The next step is preparing the data/lang directory by running prepare_lang.sh:
utils/prepare_lang.sh data/local/dict "" data/local/lang data/lang
Creating corpus.txt
This file should contain all the sentences you want to use in your dataset. You can simply extract sentences from the text file by using a script to remove the utterance-id.
Installing SRILM for Language Model Creation
To install SRILM:
!wget https://www.speech.sri.com/projects/srilm/srilm.tar.gz
!tar -xzvf srilm.tar.gz
Then, install SRILM:
!./install_srilm.sh && ./env.sh
After installing SRILM, run lm_creation.sh to create the language model:
./lm_creation.sh
Alignment of Data
Use the align_train.sh script to align the data:
./align_train.sh
Training the Model
After alignment, you can train the model using the run_tdnn_1j.sh script:
local/chain/tuning/run_tdnn_1j.sh
Preparing the Final Model
After training is complete, collect all the necessary files and prepare the model using the copy_final_result.sh script:
./copy_final_result.sh
Creating the model.conf File
You need to create the model.conf file to specify the model settings. For example:
--min-active=200
--max-active=3000
--beam=10.0
--lattice-beam=2.0
--acoustic-scale=1.0
--frame-subsampling-factor=3
--endpoint.silence-phones=1:2:3:4:5:6:7:8:9:10
Now you have a model that is fully compatible with Vosk and can be used to recognize the numbers from 0 to 13.
The text was updated successfully, but these errors were encountered: