Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimized DeepSeek-v2 on Gaudi #1677

Merged
merged 16 commits into from
Jan 28, 2025
Merged

Optimized DeepSeek-v2 on Gaudi #1677

merged 16 commits into from
Jan 28, 2025

Conversation

gyou2021
Copy link
Contributor

@gyou2021 gyou2021 commented Jan 6, 2025

  1. Optimized MLA attention for inference;
  2. Optimzed MoE with dynamic MoE and expert slice for inference;
  3. Optimzed cache for inference;
  4. Optimized training.

Inference example:
python run_generation.py
--model_name_or_path
--bf16
--trim_logits
--batch_size 2
--max_input_tokens 128
--max_new_tokens 128
--use_hpu_graphs
--use_kv_cache
--reuse_cache
--bucket_size 128
--bucket_internal
--limit_hpu_graphs
--use_flash_attention

Training example:
python3 ../gaudi_spawn.py --world_size 8 --use_deepspeed run_clm.py
--config_name deepseek_v2(model_name_or_path for your local)
--tokenizer_name deepseek_v2(model_name_or_path for your local)
--dataset_name alpaca
--block_size 4096
--do_train
--num_train_epochs 1
--max_steps 10
--per_device_train_batch_size 1
--gradient_accumulation_steps 1
--use_flash_attention True
--attn_softmax_bf16 False
--gradient_checkpointing
--learning_rate 2.4e-4
--gaudi_config_name gaudi_config.json(just like Habana/gpt2)
--bf16
--save_strategy no
--no_save_last_ckpt
--output_dir path_to_out
--overwrite_output_dir
--report_to tensorboard
--logging_strategy steps
--logging_dir path_to_log
--logging_steps 1
--evaluation_strategy no
--use_habana
--use_lazy_mode
--deepspeed zero.json

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

@emascarenhas
Copy link
Contributor

@gyou2021 ,
There are errors in style. Please run "make style" and fix the errors that are reported.

Do you need to add or modify tests?

Suggest running all the text generation slow tests with and without change so that no new issues are seen from this change. Please post results here.
GAUDI2_CI=1 RUN_SLOW=1 python -m pytest tests/test_text_generation_example.py

Is there any impact on fine tuning or training scripts?
Are any changes to README required per the command line?

Edward

@xuguangxin
Copy link

@mengker33 @xhaihao @hlin99 please help review too.
Thank you.

@ranzhejiang
Copy link
Contributor

@gyou2021 , There are errors in style. Please run "make style" and fix the errors that are reported.

Do you need to add or modify tests?

Suggest running all the text generation slow tests with and without change so that no new issues are seen from this change. Please post results here. GAUDI2_CI=1 RUN_SLOW=1 python -m pytest tests/test_text_generation_example.py

Is there any impact on fine tuning or training scripts? Are any changes to README required per the command line?

Edward

Thank you for review, for training scripts, we just only add some missing fused kernel config arguments for run_clm.py following the already existing run_lora_clm.py.

@gyou2021
Copy link
Contributor Author

gyou2021 commented Jan 8, 2025

@gyou2021 , There are errors in style. Please run "make style" and fix the errors that are reported.
Do you need to add or modify tests?
Suggest running all the text generation slow tests with and without change so that no new issues are seen from this change. Please post results here. GAUDI2_CI=1 RUN_SLOW=1 python -m pytest tests/test_text_generation_example.py
Is there any impact on fine tuning or training scripts? Are any changes to README required per the command line?
Edward

Thank you for review, for training scripts, we just only add some missing fused kernel config arguments for run_clm.py following the already existing run_lora_clm.py.

Thank you for your comments.
Fixed style errors .
No, it is not needed to add or modify tests for deepseek-v2 inference since the PR is about its optimization.
Updated DeepSeek-v2 in README.
The Deepseek-V2-Lite test case passed the pytest.

@emascarenhas
Copy link
Contributor

Hi @gyou2021 , Can you provide a command line you expect to work with Deepseek-v2 and with run_clm.py?

@gyou2021
Copy link
Contributor Author

gyou2021 commented Jan 14, 2025

Hi @gyou2021 , Can you provide a command line you expect to work with Deepseek-v2 and with run_clm.py?

python3 ../gaudi_spawn.py --world_size 8 --use_deepspeed run_clm.py
--config_name deepseek_v2(model_name_or_path for your local)
--tokenizer_name deepseek_v2(model_name_or_path for your local)
--dataset_name alpaca
--block_size 4096
--do_train
--num_train_epochs 1
--max_steps 10
--per_device_train_batch_size 1
--gradient_accumulation_steps 1
--use_flash_attention True
--attn_softmax_bf16 False
--gradient_checkpointing
--learning_rate 2.4e-4
--gaudi_config_name gaudi_config.json(just like Habana/gpt2)
--bf16
--save_strategy no
--no_save_last_ckpt
--output_dir path_to_out
--overwrite_output_dir
--report_to tensorboard
--logging_strategy steps
--logging_dir path_to_log
--logging_steps 1
--evaluation_strategy no
--use_habana
--use_lazy_mode
--deepspeed zero.json

@emascarenhas
Copy link
Contributor

I could not run the command provided as is since the config file, etc weren't local. Here is a command that works which will pull the model from HF. Similar commands using other Huggingface DeepSeek models should also work.

python3 ../gaudi_spawn.py --world_size 8 --use_deepspeed run_clm.py
--config_name deepseek-ai/DeepSeek-V2-Lite
--tokenizer_name deepseek-ai/DeepSeek-V2-Lite
--dataset_name tatsu-lab/alpaca
--block_size 4096
--do_train
--num_train_epochs 1
--max_steps 10
--per_device_train_batch_size 1
--gradient_accumulation_steps 1
--use_flash_attention True
--attn_softmax_bf16 False
--learning_rate 2.4e-4
--gaudi_config_name Habana/gpt2
--bf16
--save_strategy no
--no_save_last_ckpt
--output_dir path_to_out
--overwrite_output_dir
--report_to tensorboard
--logging_strategy steps
--logging_dir path_to_log
--logging_steps 1
--evaluation_strategy no
--use_habana
--use_lazy_mode

@ranzhejiang
Copy link
Contributor

I could not run the command provided as is since the config file, etc weren't local. Here is a command that works which will pull the model from HF. Similar commands using other Huggingface DeepSeek models should also work.

python3 ../gaudi_spawn.py --world_size 8 --use_deepspeed run_clm.py --config_name deepseek-ai/DeepSeek-V2-Lite --tokenizer_name deepseek-ai/DeepSeek-V2-Lite --dataset_name tatsu-lab/alpaca --block_size 4096 --do_train --num_train_epochs 1 --max_steps 10 --per_device_train_batch_size 1 --gradient_accumulation_steps 1 --use_flash_attention True --attn_softmax_bf16 False --learning_rate 2.4e-4 --gaudi_config_name Habana/gpt2 --bf16 --save_strategy no --no_save_last_ckpt --output_dir path_to_out --overwrite_output_dir --report_to tensorboard --logging_strategy steps --logging_dir path_to_log --logging_steps 1 --evaluation_strategy no --use_habana --use_lazy_mode

Thank you for testing, and now it can work with your command, if you meet OOM, you can add --deepspeed zero3.json or --gradient_checkpointing to reduce memory.

@emascarenhas
Copy link
Contributor

I was able to run training successfully with the additional options suggested.
The command appears to hang if the deepspeed config is not passed, it should have errored out with OOM.
Here is a command that completes successfully.
python3 ../gaudi_spawn.py --world_size 8 --use_deepspeed run_clm.py
--config_name deepseek-ai/DeepSeek-V2-Lite
--tokenizer_name deepseek-ai/DeepSeek-V2-Lite
--dataset_name tatsu-lab/alpaca
--block_size 4096
--do_train
--num_train_epochs 1
--max_steps 10
--per_device_train_batch_size 1
--gradient_accumulation_steps 1
--use_flash_attention True
--attn_softmax_bf16 False
--gradient_checkpointing
--learning_rate 2.4e-4
--gaudi_config_name Habana/gpt2
--bf16
--save_strategy no
--no_save_last_ckpt
--output_dir /root/deepseek-v2-lite
--overwrite_output_dir
--logging_strategy steps
--logging_dir /root/deepseek-v2-lite/log
--logging_steps 1
--evaluation_strategy no
--use_habana
--use_lazy_mode
--deepspeed llama2_ds_zero3_config.json

@gyou2021
Copy link
Contributor Author

We added the training command you verified into the README.txt of language modeling.

I was able to run training successfully with the additional options suggested. The command appears to hang if the deepspeed config is not passed, it should have errored out with OOM. Here is a command that completes successfully. python3 ../gaudi_spawn.py --world_size 8 --use_deepspeed run_clm.py --config_name deepseek-ai/DeepSeek-V2-Lite --tokenizer_name deepseek-ai/DeepSeek-V2-Lite --dataset_name tatsu-lab/alpaca --block_size 4096 --do_train --num_train_epochs 1 --max_steps 10 --per_device_train_batch_size 1 --gradient_accumulation_steps 1 --use_flash_attention True --attn_softmax_bf16 False --gradient_checkpointing --learning_rate 2.4e-4 --gaudi_config_name Habana/gpt2 --bf16 --save_strategy no --no_save_last_ckpt --output_dir /root/deepseek-v2-lite --overwrite_output_dir --logging_strategy steps --logging_dir /root/deepseek-v2-lite/log --logging_steps 1 --evaluation_strategy no --use_habana --use_lazy_mode --deepspeed llama2_ds_zero3_config.json

@emascarenhas
Copy link
Contributor

@libinta , Looks good to me to proceed to run-test & merge approval.

@libinta libinta added the run-test Run CI for PRs from external contributors label Jan 22, 2025
Copy link
Collaborator

@regisss regisss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few comments.
Also, have you checked that this test still passes?

README.md Show resolved Hide resolved
Comment on lines +187 to +216
### Multi-card Training with Deepspeed (DeepSeek-V2-Lite)
```bash
python ../gaudi_spawn.py --world_size 8 --use_deepspeed run_clm.py
--config_name deepseek-ai/DeepSeek-V2-Lite
--tokenizer_name deepseek-ai/DeepSeek-V2-Lite
--dataset_name tatsu-lab/alpaca
--block_size 4096
--do_train
--num_train_epochs 1
--max_steps 10
--per_device_train_batch_size 1
--gradient_accumulation_steps 1
--use_flash_attention True
--attn_softmax_bf16 False
--gradient_checkpointing
--learning_rate 2.4e-4
--gaudi_config_name Habana/gpt2
--bf16
--save_strategy no
--no_save_last_ckpt
--output_dir /root/deepseek-v2-lite
--overwrite_output_dir
--logging_strategy steps
--logging_dir /root/deepseek-v2-lite/log
--logging_steps 1
--evaluation_strategy no
--use_habana
--use_lazy_mode
--deepspeed llama2_ds_zero3_config.json
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IS there any specific arg here that is specific to this model? Otherwise let's just try to keep the README concise.

Copy link
Contributor Author

@gyou2021 gyou2021 Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Certain arguments, such as --use_flash_attention, enable optimization during training. Some of these argument values were derived from our experiments. The command has been verified on both the local server and DevCloud. If any arguments are missing, such as --deepspeed llama2_ds_zero3_config.json, the training process may fail. To ensure successful execution, we have included these details in the README.

Comment on lines +122 to +143
Computes auxiliary load balancing loss as in Switch Transformer - implemented in Pytorch.

See Switch Transformer (https://arxiv.org/abs/2101.03961) for more details. This function implements the loss
function presented in equations (4) - (6) of the paper. It aims at penalizing cases where the routing between
experts is too unbalanced.

Args:
gate_logits:
Logits from the `gate`, should be a tuple of model.config.num_hidden_layers tensors of
shape [batch_size X sequence_length, num_experts].
num_experts:
Number of experts
top_k:
The number of experts to route per-token, can be also interpreted as the `top-k` routing
parameter.
attention_mask (`torch.Tensor`, *optional*):
The attention_mask used in forward function
shape [batch_size X sequence_length] if not None.

Returns:
The auxiliary loss.
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain how this method differs from the original?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The auxiliary load balancing loss here is a mechanism designed to ensure that the computational load is evenly distributed among the available experts. This loss function penalizes scenarios where the routing of tokens to experts is highly imbalanced, encouraging a more uniform distribution of tokens across the experts.
The goal is to prevent any single expert from being overloaded while others remain underutilized. This helps in maintaining efficiency and stability during training and inference.

@gyou2021
Copy link
Contributor Author

this test still passes?

Yes. Our tests verified this test about DeepSeek-V2-Lite still passes.

Copy link
Collaborator

@regisss regisss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll have to run make syle to make the style check pass

@regisss
Copy link
Collaborator

regisss commented Jan 28, 2025

@gyou2021 There is also a merge conflict to solve

@gyou2021
Copy link
Contributor Author

gyou2021 commented Jan 28, 2025

You'll have to run make syle to make the style check pass

It has been run and style problems were fixed. The style check passed.

@gyou2021
Copy link
Contributor Author

@gyou2021 There is also a merge conflict to solve

Solved. Thank you.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@regisss regisss merged commit f50938c into huggingface:main Jan 28, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run-test Run CI for PRs from external contributors
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants