This repository provides a script and recipe to train the Transformer model for Tensorflow on Habana Gaudi device. For further information on performance, refer to Habana Model Performance Data page.
For more information on training deep learning models using Gaudi, refer to developer.habana.ai.
- Model-References
- Model Overview
- Setup
- Training and Examples
- Evaluating BLEU Score
- Profile
- Supported Configuration
- Changelog
- Known Issues
The Transformer is a Neural Machine Translation (NMT) model which uses attention mechanism to boost training speed and overall accuracy. The model was initially introduced in Attention Is All You Need. This implementation is based on Tensor2Tensor implementation (authors: Google Inc., Artit Wangperawong).
There are three model variants available: tiny, base and big.
The Transformer model uses standard NMT encoder-decoder architecture. Unlike other NMT models, Transformer model does not use recurrent connections and operates on fixed size context window. The encoder stack is made up of N identical layers. Each layer is composed of the following sub-layers:
- Self-attention layer
- Feedforward network (which is 2 fully-connected layers)
The decoder stack is also made up of N identical layers. Each layer is composed of the sub-layers:
- Self-attention layer
- Multi-headed attention layer combining encoder outputs with results from the previous self-attention layer.
- Feedforward network (2 fully-connected layers)
The encoder uses self-attention to compute a representation of the input sequence. The decoder generates the output sequence one token at a time, taking the encoder output and previous decoder-outputted tokens as inputs. The model also applies embeddings on the input and output tokens, and adds a constant positional encoding. The positional encoding adds information about the position of each token.
The complete description of the Transformer architecture can be found in Attention Is All You Need paper.
Please follow the instructions provided in the Gaudi Installation Guide to set up the environment including the $PYTHON
environment variable. To achieve the best performance, please follow the methods outlined in the Optimizing Training Platform guide.
The guides will walk you through the process of setting up your system to run the model on Gaudi.
In the docker container, clone this repository and switch to the branch that matches your SynapseAI version. You can run the hl-smi
utility to determine the SynapseAI version.
git clone -b [SynapseAI version] https://github.com/HabanaAI/Model-References /root/Model-References
Note: If Model-References repository path is not in the PYTHONPATH, make sure you update it:
export PYTHONPATH=$PYTHONPATH:/root/Model-References
Go to the Transformer directory and generate the dataset. The following script will save the dataset to /data/tensorflow/wmt32k_packed/train
:
cd Model-References/TensorFlow/nlp/transformer/
$PYTHON datagen.py \
--data_dir=/data/tensorflow/wmt32k_packed/train \
--tmp_dir=/tmp/transformer_datagen \
--problem=translate_ende_wmt32k_packed \
--random_seed=429459
- In the docker container, go to the Transformer directory:
cd /root/Model-References/TensorFlow/nlp/transformer
- Install the required packages using pip:
$PYTHON -m pip install -r requirements.txt
NOTE: All training examples for 1 HPU and 8 HPUs are valid both for first-gen Gaudi and Gaudi2.
Run training on 1 HPU:
$PYTHON trainer.py \
--data_dir=<path_to_dataset>/train \
--problem=translate_ende_wmt32k_packed \
--model=transformer \
--hparams_set=transformer_<model_size> \
--hparams=batch_size=<batch_size> \
--output_dir=<path_to_output_dir> \
--local_eval_frequency=<eval_frequency> \
--train_steps=<train_steps> \
--schedule=train \
--use_hpu=True \
--use_bf16=<use_bf16>
Run training on 1 HPU, batch size 4096, bfloat16, transformer_big, 300k steps with a checkpoint saved every 10k steps, last 10 checkpoints kept:
$PYTHON trainer.py \
--data_dir=/data/tensorflow/wmt32k_packed/train/ \
--problem=translate_ende_wmt32k_packed \
--model=transformer \
--hparams_set=transformer_big \
--hparams=batch_size=4096 \
--output_dir=./translate_ende_wmt32k_packed/transformer_big/bs4096 \
--local_eval_frequency=10000 \
--keep_checkpoint_max=10 \
--train_steps=300000 \
--schedule=train \
--use_hpu=True \
--use_bf16=True
For Gaudi2, training batch size can be increased for better performance:
$PYTHON trainer.py \
--data_dir=/data/tensorflow/wmt32k_packed/train/ \
--problem=translate_ende_wmt32k_packed \
--model=transformer \
--hparams_set=transformer_big \
--hparams=batch_size=16384,learning_rate_constant=5.0,learning_rate_warmup_steps=5000 \
--output_dir=./translate_ende_wmt32k_packed/transformer_big/bs16384 \
--local_eval_frequency=2500 \
--keep_checkpoint_max=10 \
--train_steps=75000 \
--schedule=train \
--use_hpu=True \
--use_bf16=True
Run training on 8 HPUs:
NOTE: mpirun map-by PE attribute value may vary on your setup. For the recommended calculation, refer to the instructions detailed in mpirun Configuration.
Run training on 8 HPUs, global batch size 8 * 4096, bfloat16, transformer_big, 300k steps with a checkpoint saved every 10k steps, last 10 checkpoints kept, learning rate constant 2.5:
mpirun \
--allow-run-as-root --bind-to core --map-by socket:PE=6 --np 8 \
--tag-output --merge-stderr-to-stdout \
$PYTHON trainer.py \
--data_dir=/data/tensorflow/wmt32k_packed/train/ \
--problem=translate_ende_wmt32k_packed \
--model=transformer \
--hparams_set=transformer_big \
--hparams=batch_size=4096,learning_rate_constant=2.5 \
--output_dir=./translate_ende_wmt32k_packed/transformer_big/bs4096 \
--local_eval_frequency=10000 \
--keep_checkpoint_max=10 \
--train_steps=300000 \
--schedule=train \
--use_horovod=True \
--use_hpu=True \
--use_bf16=True
For Gaudi2, training batch size can be increased for better performance:
mpirun \
--allow-run-as-root --bind-to core --map-by socket:PE=6 --np 8 \
--tag-output --merge-stderr-to-stdout \
$PYTHON trainer.py \
--data_dir=/data/tensorflow/wmt32k_packed/train/ \
--problem=translate_ende_wmt32k_packed \
--model=transformer \
--hparams_set=transformer_big \
--hparams=batch_size=16384,learning_rate_constant=5.0,learning_rate_warmup_steps=5000 \
--output_dir=./translate_ende_wmt32k_packed/transformer_big/bs16384 \
--local_eval_frequency=2500 \
--keep_checkpoint_max=10 \
--train_steps=75000 \
--schedule=train \
--use_horovod=True \
--use_hpu=True \
--use_bf16=True
To run training on multiple servers, make sure to set the MULTI_HLS_IPS
environment
variable with the IPs of the used servers.
NOTE: Multi-server training is supported only on first-gen Gaudi.
Run training on 16 HPUs:
export MULTI_HLS_IPS=192.10.100.174,10.10.100.101
mpirun \
--allow-run-as-root --bind-to core --map-by socket:PE=6 --np 8 \
--tag-output --merge-stderr-to-stdout \
$PYTHON trainer.py \
--data_dir=/data/tensorflow/wmt32k_packed/train/ \
--problem=translate_ende_wmt32k_packed \
--model=transformer \
--hparams_set=transformer_big \
--hparams=batch_size=4096,learning_rate_constant=3.0 \
--output_dir=./translate_ende_wmt32k_packed/transformer_big/bs4096 \
--local_eval_frequency=50000 \
--train_steps=150000 \
--schedule=train \
--use_horovod=True \
--use_hpu=True \
--use_bf16=True
Run training on 32 HPUs:
NOTE: It is recommended to use learning_rate_constant
3.5 and train_steps
75000.
export MULTI_HLS_IPS=192.10.100.174,10.10.100.101,10.10.100.102,10.10.100.103
mpirun \
--allow-run-as-root --bind-to core --map-by socket:PE=6 --np 8 \
--tag-output --merge-stderr-to-stdout \
$PYTHON trainer.py \
--data_dir=/data/tensorflow/wmt32k_packed/train/ \
--problem=translate_ende_wmt32k_packed \
--model=transformer \
--hparams_set=transformer_big \
--hparams=batch_size=4096,learning_rate_constant=3.5 \
--output_dir=./translate_ende_wmt32k_packed/transformer_big/bs4096 \
--local_eval_frequency=50000 \
--train_steps=75000 \
--schedule=train \
--use_horovod=True \
--use_hpu=True \
--use_bf16=True
After training the model, you can evaluate the achieved BLEU score:
- Download and tokenize the validation file:
sacrebleu -t wmt14 -l en-de --echo src > wmt14.src
cat wmt14.src | sacremoses tokenize -l en > wmt14.src.tok
- Compute BLEU score of a single checkpoint:
$PYTHON decoder.py \
--problem=translate_ende_wmt32k_packed \
--model=transformer \
--hparams_set=transformer_big \
--data_dir=<path_to_dataset>/train \
--output_dir=<path_to_output_dir> \
--checkpoint_path=<path_to_checkpoint> \
--use_hpu=True \
--decode_from_file=./wmt14.src.tok \
--decode_to_file=./wmt14.tgt.tok \
--decode_hparams=log_results=False
cat wmt14.tgt.tok | sacremoses detokenize -l de | sacrebleu -t wmt14 -l en-de
- Optional: To split BLEU calculation to multiple cards, run
decoder.py
throughmpirun
. For example:
mpirun \
--allow-run-as-root --bind-to core --map-by socket:PE=6 --np 8 \
--tag-output --merge-stderr-to-stdout \
$PYTHON decoder.py \
--problem=translate_ende_wmt32k_packed \
--model=transformer \
--hparams_set=transformer_big \
--data_dir=<path_to_dataset>/train \
--output_dir=<path_to_output_dir> \
--checkpoint_path=<path_to_checkpoint> \
--decode_from_file=./wmt14.src.tok \
--decode_to_file=./wmt14.tgt.tok \
--use_hpu=True \
--use_horovod=True \
--decode_hparams=log_results=False
cat wmt14.tgt.tok | sacremoses detokenize -l de | sacrebleu -t wmt14 -l en-de
NOTE: mpirun map-by PE attribute value may vary on your setup. For the recommended calculation, refer to the instructions detailed in mpirun Configuration.
To run with profiling enabled, pass --profile_steps
flag. It should be a comma separated pair of numbers - on which step to start and end profiling.
Profiler steps are counted individually for each run. Thus, if you run training for 100 steps, with --profile_steps 99,100
, profiling will be always enabled for the last two steps, no matter the global_step_count
.
Run training on 1 HPU with profiler:
$PYTHON trainer.py \
--data_dir=/data/tensorflow/wmt32k_packed/train/ \
--problem=translate_ende_wmt32k_packed \
--model=transformer \
--hparams_set=transformer_big \
--hparams=batch_size=4096 \
--output_dir=./translate_ende_wmt32k_packed/transformer_big/bs4096 \
--local_eval_frequency=10000 \
--train_steps=100 \
--schedule=train \
--use_hpu=True \
--profile_steps 50,53
The above example will produce profile trace for 4 steps (50,51,52,53).
Validated on | SynapseAI Version | TensorFlow Version(s) | Mode |
---|---|---|---|
Gaudi | 1.13.0 | 2.13.1 | Training |
Gaudi2 | 1.13.0 | 2.13.1 | Training |
- Model enabled on Gaudi2, with the same config as first-gen Gaudi.
- Added profiling support.
- Enabled experimental variable clustering to improve performance.
- Removed advanced parameters section from README.
- Replaced references to custom demo script by community entry points in README.
- Added support to import horovod-fork package directly instead of using Model-References' TensorFlow.common.horovod_helpers; wrapped horovod import with a try-catch block so that the user is not required to install this library when the model is being run on a single card.
- Updated requirements.txt.
- Changed the default value of the log_step_count_steps flag.
- Enabled multi-HPU BLEU calculation.
- Updated requirements.txt.
- Added support for recipe cache, see
TF_RECIPE_CACHE_PATH
in HabanaAI documentation for details. - Enabled multi-server training.
- Support for other models than Transformer was removed.
- Added support for Horovod together with some adjustments in the topology script to allow simplifying the computational graph.
Only FP32 precision is supported when calculating BLEU on HPU.