Name		Name	Last commit message	Last commit date
parent directory ..
dataset		dataset
examples		examples
images		images
megatron		megatron
scripts		scripts
tasks		tasks
tests		tests
tools		tools
CODEOWNERS		CODEOWNERS
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
SECURITY.md		SECURITY.md
pretrain_bert.py		pretrain_bert.py
pretrain_gpt.py		pretrain_gpt.py
pretrain_ict.py		pretrain_ict.py
pretrain_llama.py		pretrain_llama.py
pretrain_t5.py		pretrain_t5.py
pretrain_vit.py		pretrain_vit.py
requirements.txt		requirements.txt
setup.py		setup.py

README.md

BLOOM and LLaMA for PyTorch

This directory provide scripts to train the GPT-based BLOOM 13B and LLaMA models in the Megatron-DeepSpeed repository.

Model Overview

This implementation is based on https://github.com/microsoft/Megatron-DeepSpeed at 0c58dbb. Megatron (1 and 2) is a large, powerful transformer developed by the Applied Deep Learning Research team at NVIDIA. This repository is for training large transformer language models such as Bloom at scale. Codebase is capable of efficiently training very large (hundreds of billions of parameters) language models with both model and data parallelism. LLaMA training is based on https://arxiv.org/abs/2302.13971

How to use

Users bear sole liability and responsibility to follow and comply with any third party licenses, and Habana Labs disclaims and will bear no liability with respect to users’ use or compliance with third party licenses.

Setup

Please follow the instructions provided in the Gaudi Installation Guide to set up the environment including the $PYTHON environment variable. To achieve the best performance, please follow the methods outlined in the Optimizing Training Platform guide. The guides will walk you through the process of setting up your system to run the model on Gaudi2.

Install Habana DeepSpeed-fork

Please follow the instructions provided in the DeepSpeed Installation Guide to install deepspeed-fork.

Clone Habana Model-References

In the docker container, clone this repository and switch to the branch that matches your SynapseAI version. You can run the hl-smi utility to determine the SynapseAI version.

git clone -b [SynapseAI version] https://github.com/HabanaAI/Model-References

export MODEL_REFERENCES_ROOT=/path/to/Model-References
export PYTHONPATH=/path/to/Model-References/PyTorch/common:$PYTHONPATH

Install Model Requirements

In the docker container, go to the model directory:

cd Model-References/PyTorch/nlp/DeepSpeedExamples/Megatron-DeepSpeed/

Install the required packages using pip:
```
pip install -r requirements.txt
```

Dataset Preparation

Follow the instructions in https://github.com/bigscience-workshop/bigscience/tree/master/data/oscar to download oscar-en full dataset. Note that the dataset takes around 550G of disk space. This dataset is used for training both BLOOM and LLaMA.

Dataset Preparation Examples

The below provides the steps required to prepare your dataset. It is based on instructions in https://github.com/bigscience-workshop/bigscience/tree/master/data/oscar.

Step 0 :

git clone https://github.com/bigscience-workshop/bigscience.git
cd data/oscar
# choose  in below python code file, list language_subsets default is en - unshuffled_deduplicated_zh and comment out unshuffled_deduplicated_en
vi oscar-to-jsonl.py

Step 1 :

# -s can be added for subset of data
$PYTHON oscar-to-jsonl.py

Step 2 :

mkdir -p zh
mv oscar*.jsonl zh
cd zh
cat oscar-[0-4].jsonl > oscar-zh.jsonl

Step 3 :

$PYTHON $MODEL_REFERENCES_ROOT/PyTorch/nlp/DeepSpeedExamples/Megatron-DeepSpeed/tools/preprocess_data.py --input zh/oscar-zh.jsonl --output-prefix $MODEL_REFERENCES_ROOT/PyTorch/nlp/DeepSpeedExamples/Megatron-DeepSpeed/zh/tokenized --vocab-file gpt2-vocab.json --merge-file gpt2-merges.txt --append-eod --tokenizer-type GPT2BPETokenizer --workers 64
# use the tokenized files from above step to train

Training and Examples

Training of Bloom13B model is based on https://github.com/bigscience-workshop/bigscience/blob/master/train/tr1-13B-base/tr1-13B-round1.slurm Training of LLaMA is based on https://arxiv.org/abs/2302.13971 Training of LLaMA 2 is based on https://arxiv.org/pdf/2307.09288

Multi-Card Training Examples

Update data root dir with the path of your choice:
```
HL_DATA_DIR_ROOT=/data/bigscience/oscar-en
```

Run BLOOM on 8 HPUs with BF16 precision:

HL_NUM_NODES=1 HL_PP=1 HL_TP=4 HL_DP=2 scripts/run_bloom13b.sh

Run BLOOM on 32 HPUs with BF16 precision: (Note: Make sure to change the IP addresses in hostsfile according to your setup)
```
HL_HOSTSFILE=scripts/hostsfile HL_NUM_NODES=4 HL_PP=2 HL_TP=4 HL_DP=4 scripts/run_bloom13b.sh
```
Run BLOOM on 64 HPUs with BF16 precision: (Note: Make sure to change the IP addresses in hostsfile according to your setup)
```
HL_HOSTSFILE=scripts/hostsfile HL_NUM_NODES=8 HL_PP=2 HL_TP=2 HL_DP=16 scripts/run_bloom13b.sh
```

Run LLaMA 13B on 8 HPUs with BF16 precision:

HL_NUM_NODES=1 HL_PP=2 HL_TP=4 HL_DP=1 scripts/run_llama13b.sh

Run LLaMA 13B on 64 HPUs with BF16 precision: (Note: Make sure to change the IP addresses in hostsfile according to your setup)
```
HL_HOSTSFILE=scripts/hostsfile HL_NUM_NODES=8 HL_PP=2 HL_TP=2 HL_DP=16 scripts/run_llama13b.sh
```
Run LLaMA 2 70B on 256 HPUs with BF16 precision: (Note: Make sure to change the IP addresses in hostsfile according to your setup)
```
HL_HOSTSFILE=scripts/hostsfile HL_CKP_ACT=2 HL_MICRO_BATCH=1 HL_NUM_NODES=32 HL_PP=8 HL_TP=8 HL_DP=4 scripts/run_llamav2.sh
```

Supported Configuration

Validated on	SynapseAI Version	PyTorch Version	Mode
Gaudi2	1.13.0	2.1.0	Training

Changelog

1.13.0

Added support for LLaMA 2 70B
Added support for FusedSDPA
Added support for Sequence Parallelism

1.12.0

Updated the recommended 3D-parallelism configuration for BLOOM & LLaMA.

1.10.0

Updated the recommended 3D-parallelism configuration for BLOOM.
Added support for LLaMA.

1.8.0

Initial release.

Script Modifications

Major changes done to the original model from microsoft/Megatron-DeepSpeed repository:

Changed README file content.
Replaced CUDA specific API calls with generic ones.
Switched GPT default optimizer from Adam to AdamW.
Added support for universal checkpoint based on universal checkpoint support in Bloom.
Added kill-switch mechanism to gracefully stop training based on support in Bloom.
Added HPU memory logging.

Known Issues

Only scripts and configurations mentioned in this README are supported and verified.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Megatron-DeepSpeed

Megatron-DeepSpeed

README.md

BLOOM and LLaMA for PyTorch

Table of Contents

Model Overview

How to use

Setup

Install Habana DeepSpeed-fork

Clone Habana Model-References

Install Model Requirements

Dataset Preparation

Dataset Preparation Examples

Step 0 :

Step 1 :

Step 2 :

Step 3 :

Training and Examples

Multi-Card Training Examples

Supported Configuration

Changelog

1.13.0

1.12.0

1.10.0

1.8.0

Script Modifications

Known Issues

Files

Megatron-DeepSpeed

Directory actions

More options

Directory actions

More options

Latest commit

History

Megatron-DeepSpeed

Folders and files

parent directory

README.md

BLOOM and LLaMA for PyTorch

Table of Contents

Model Overview

How to use

Setup

Install Habana DeepSpeed-fork

Clone Habana Model-References

Install Model Requirements

Dataset Preparation

Dataset Preparation Examples

Step 0 :

Step 1 :

Step 2 :

Step 3 :

Training and Examples

Multi-Card Training Examples

Supported Configuration

Changelog

1.13.0

1.12.0

1.10.0

1.8.0

Script Modifications

Known Issues