Tweak Pipeline Parallel layer split strategy #20

C-TC · 2024-09-03T14:27:15Z

The previous pipelineblock split strategy in nanotron tries to evenly split the layers to each device in terms of FLOPs. However, since the embedding table is becoming larger and larger, memory consumption should be taken into account.

We follow the practice in Llama 405B training, which treat the embedding table and lm head as heavy as a single transformer block. (This is why 405B model has 126 layers.)

Any suggestions for better general strategy?

e.g. a 32-layer llama model and PP=4:

log "No checkpoint path provided" only on rank 0

remove torch compile

eliebak and others added 5 commits August 27, 2024 15:22

remove torch compile

3a45a34

only log ckpt on rank 0

3340a7b

Merge pull request huggingface#224 from eliebak/fix-logging-checkpoint

6cf2d63

log "No checkpoint path provided" only on rank 0

Merge pull request huggingface#223 from eliebak/fix-torch-compile

4a2ddca

remove torch compile

fix attempt 1: treat embedding as heavy as transformer layer

964ceae

C-TC mentioned this pull request Sep 3, 2024

Get 70b in our fork working with pp4, tp4, dp>1 #19

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tweak Pipeline Parallel layer split strategy #20

Tweak Pipeline Parallel layer split strategy #20

C-TC commented Sep 3, 2024

Tweak Pipeline Parallel layer split strategy #20

Are you sure you want to change the base?

Tweak Pipeline Parallel layer split strategy #20

Conversation

C-TC commented Sep 3, 2024