This folder contains scripts to pre-train HuBERT model on Habana Gaudi device. To obtain model performance data, refer to the Habana Model Performance Data page.
For more information about training deep learning models using Gaudi, visit developer.habana.ai.
- Model-References
- Model Overview
- Setup
- Pre-training and Examples
- Supported Configurations
- Script Modifications
This directory contains sample implementations of pre-training pipeline based on HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. This version is based on PyTorch Audio version v2.0.1.
Please follow the instructions provided in the Gaudi Installation Guide
to set up the environment including the $PYTHON
environment variable.
The guide will walk you through the process of setting up your system to run the model on Gaudi.
In the docker container, clone this repository and switch to the branch that matches your SynapseAI version.
You can run the hl-smi
utility to determine the SynapseAI version.
git clone -b [SynapseAI version] https://github.com/HabanaAI/Model-References
Note: If the repository is not in the PYTHONPATH, make sure to update by running the below.
export PYTHONPATH=/path/to/Model-References:$PYTHONPATH
- In the docker container, go to the model directory:
cd Model-References/PyTorch/audio/hubert
- Install the required packages using pip.
pip install -r requirements.txt
Download LibriSpeech dataset.
python dataset_itr1.py --download_dir=/scratch1/hubert/itr1_data --librispeech_split 960 --cleanup
NOTE:
- The above command generates the required data under /scratch1/hubert/itr1_data/960_extract.
- Rename /scratch1/hubert/itr1_data/960_extract to /data/pytorch/wav2vec/data/LibriSpeech/train-960/, that is used as root-dir in pre-processing example commands.
The environment variables PT_HPU_AUTOCAST_LOWER_PRECISION_OPS_LIST
and PT_HPU_AUTOCAST_FP32_OPS_LIST
are set by default to enable Autocast on Gaudi. These environment variables are set with default paths when executing the script. This can be modified when providing custom lists.
The following are the default Autocast Ops files:
- PT_HPU_AUTOCAST_LOWER_PRECISION_OPS_LIST: ./ops_bf16_hubert.txt
- PT_HPU_AUTOCAST_FP32_OPS_LIST: ./ops_fp32_hubert.txt
For further details, refer to PyTorch Mixed Precision.
The Base architecture of HuBERT model requires two iterations of pre-training.
preprocess.py
generates the file list of training and validation data, trains a KMeans clustering model with either MFCC feature or the transformer layer's output from the pre-trained HuBERT model, then predict the cluster ID for each utterance as the label for masked prediction training.
Using MFCC feature to train KMeans model, see the following example:
python preprocess.py --root-dir /data/pytorch/wav2vec/data/LibriSpeech/train-960/ --feat-type mfcc --exp-dir /data/pytorch/hubert/exp --num-cluster 100
NOTE:
- Please make sure the first line in /data/pytorch/hubert/exp/data/mfcc/tsv/librispeech_train.tsv and /data/pytorch/hubert/exp/data/mfcc/tsv/librispeech_valid.tsv points to the correct directory as follows /data/pytorch/wav2vec/data/LibriSpeech/train-960/.
- It is assumed that the above LibriSpeech dataset is available at path /data/pytorch/wav2vec/data/LibriSpeech/train-960/.
train.py
trains a HuBERTPretrainModel using PyTorch Lightning.
The first iteration is trained for ~250k steps on 8 cards, each rank has at most 175 seconds of audio in a mini-batch.
NOTE: OMP_NUM_THREADS value may vary on your setup. For the recommended calculation, refer to the instructions detailed in mpirun Configuration.
- Run on Gaudi2 with BF16 and on 8 cards:
PT_HPU_AUTOCAST_LOWER_PRECISION_OPS_LIST=ops_bf16_hubert.txt PT_HPU_AUTOCAST_FP32_OPS_LIST=ops_fp32_hubert.txt OMP_NUM_THREADS=10 PT_RECIPE_CACHE_PATH="/tmp/iter1_recipe_cache/" PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES=0 python -m torch.distributed.run --nnodes=1 --nproc_per_node=8 --rdzv_id=JOB_ID --rdzv_backend=c10d train.py --dataset-path /data/pytorch/hubert/exp/data/mfcc/ --exp-dir /root/scratch1/exp_iter1_cache --feature-type mfcc --num-class 100 --max-updates 245594 --learning-rate 0.0005 --hpus 8 --num-nodes 1 --num-buckets=10 --align-buckets="bucket1" --accumulate-grad-batches 1 --all-static --use-conv2d --use-instancenorm --use-fused-clip --optimizer fusedadamw --warmup-updates 31436 --autocast --seconds-per-batch 350 --use-max-sub-softmax-opt --recompilation-optimization 2>&1 | tee /root/scratch1/log_8x_hpu_bf16_iter1.txt
After the first iteration of pre-training, the intermediate transformer layer's output of the pre-trained HuBERTPretrainModel can be applied to train a new KMeans clustering model. Then the KMeans clustering model can be used to generate new clustering labels for the second iteration of masked prediction training.
- Using 6th transformer layer's output the input feature for training KMeans model, see the following example:
python preprocess.py --root-dir /data/pytorch/wav2vec/data/LibriSpeech/train-960/ --feat-type hubert --exp-dir /data/pytorch/hubert/exp --layer-index 6 --num-rank 40 --checkpoint-path exp_iter1/checkpoints_librispeech_hubert_pretrain_base/epoch=220-step=241553.ckpt --num-cluster 500 --percent 0.1 2>&1 | tee log_preprocess_2_1.txt
NOTE:
- Please make sure the first line in /data/pytorch/hubert/exp/data/hubert_6/tsv/librispeech_train.tsv and /data/pytorch/hubert/exp/data/hubert_6/tsv/librispeech_valid.tsv points to the correct directory as follows /data/pytorch/wav2vec/data/LibriSpeech/train-960/.
- It is assumed that the above LibriSpeech dataset is available at path /data/pytorch/wav2vec/data/LibriSpeech/train-960/.
The second iteration is trained for ~400k steps.
NOTE: OMP_NUM_THREADS value may vary on your setup. For the recommended calculation, refer to the instructions detailed in mpirun Configuration.
- Run on Gaudi2 with BF16 and on 8 cards:
PT_HPU_AUTOCAST_LOWER_PRECISION_OPS_LIST=ops_bf16_hubert.txt PT_HPU_AUTOCAST_FP32_OPS_LIST=ops_fp32_hubert.txt OMP_NUM_THREADS=10 PT_RECIPE_CACHE_PATH="/tmp/recipe_cache_iter2/" PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES=0 python -m torch.distributed.run --nnodes=1 --nproc_per_node=8 --rdzv_id=JOB_ID --rdzv_backend=c10d train.py --dataset-path /data/pytorch/hubert/exp/data/hubert_6/ --exp-dir /root/scratch1/exp_iter2_cache --feature-type hubert --num-class 500 --max-updates 395120 --learning-rate 0.0005 --hpus 8 --num-nodes 1 --num-buckets=10 --align-buckets="bucket1" --accumulate-grad-batches 2 --all-static --use-conv2d --use-instancenorm --use-fused-clip --optimizer fusedadamw --warmup-updates 31610 --seconds-per-batch 175 --autocast --use-max-sub-softmax-opt --recompilation-optimization --split-logits 2>&1 | tee /root/scratch1/log_8x_hpu_2nditer.txt
Training
Hubert Pretrain1, Pretrain2
Validated on | SynapseAI Version | PyTorch Lightning Version | Torch Audio Version | PyTorch Version | Mode |
---|---|---|---|---|---|
Gaudi2 | 1.10.0 | 2.0.0 | 2.0.1 | 2.0.1 | Training |
- Aligned input shape with bucket boundary to reduce input dynamicity. Adjusted train/warmup step based on dataloader change introduced by this bucket change.
- Removed boolean index dynamic in maskgenerator, encoder, and _compute_logits.
- Used torch.where to implement static layer drop.
- Made loss static by combining masked/unmasked accuracy calculation together.
- Used FusedAdam, FusedClipNorm.
- Used instancenorm to replace groupnorm, conv2d to replace conv1d, and checkpoint option with 1d/2d for feature extractor.
- Added log interval and reduce log frequency.
- Added the mixed precision, and autocast support.
- Zero-ed out gradients by setting them to None.
- Added conv2d to 1d parameter change for preprocess 2.
- Added max-sub-softmax optimization.
- Added recipe cache across ranks.
- Added distributed eval and increased batch sizes.
- Updated to support torch 2.0 and torch audio v2.0.1