-
Notifications
You must be signed in to change notification settings - Fork 225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimized DeepSeek-v2 on Gaudi #1677
Conversation
@gyou2021 , Do you need to add or modify tests? Suggest running all the text generation slow tests with and without change so that no new issues are seen from this change. Please post results here. Is there any impact on fine tuning or training scripts? Edward |
@mengker33 @xhaihao @hlin99 please help review too. |
Thank you for review, for training scripts, we just only add some missing fused kernel config arguments for run_clm.py following the already existing run_lora_clm.py. |
Thank you for your comments. |
Hi @gyou2021 , Can you provide a command line you expect to work with Deepseek-v2 and with run_clm.py? |
python3 ../gaudi_spawn.py --world_size 8 --use_deepspeed run_clm.py |
I could not run the command provided as is since the config file, etc weren't local. Here is a command that works which will pull the model from HF. Similar commands using other Huggingface DeepSeek models should also work. python3 ../gaudi_spawn.py --world_size 8 --use_deepspeed run_clm.py |
Thank you for testing, and now it can work with your command, if you meet OOM, you can add --deepspeed zero3.json or --gradient_checkpointing to reduce memory. |
I was able to run training successfully with the additional options suggested. |
We added the training command you verified into the README.txt of language modeling.
|
@libinta , Looks good to me to proceed to run-test & merge approval. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few comments.
Also, have you checked that this test still passes?
### Multi-card Training with Deepspeed (DeepSeek-V2-Lite) | ||
```bash | ||
python ../gaudi_spawn.py --world_size 8 --use_deepspeed run_clm.py | ||
--config_name deepseek-ai/DeepSeek-V2-Lite | ||
--tokenizer_name deepseek-ai/DeepSeek-V2-Lite | ||
--dataset_name tatsu-lab/alpaca | ||
--block_size 4096 | ||
--do_train | ||
--num_train_epochs 1 | ||
--max_steps 10 | ||
--per_device_train_batch_size 1 | ||
--gradient_accumulation_steps 1 | ||
--use_flash_attention True | ||
--attn_softmax_bf16 False | ||
--gradient_checkpointing | ||
--learning_rate 2.4e-4 | ||
--gaudi_config_name Habana/gpt2 | ||
--bf16 | ||
--save_strategy no | ||
--no_save_last_ckpt | ||
--output_dir /root/deepseek-v2-lite | ||
--overwrite_output_dir | ||
--logging_strategy steps | ||
--logging_dir /root/deepseek-v2-lite/log | ||
--logging_steps 1 | ||
--evaluation_strategy no | ||
--use_habana | ||
--use_lazy_mode | ||
--deepspeed llama2_ds_zero3_config.json | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IS there any specific arg here that is specific to this model? Otherwise let's just try to keep the README concise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Certain arguments, such as --use_flash_attention, enable optimization during training. Some of these argument values were derived from our experiments. The command has been verified on both the local server and DevCloud. If any arguments are missing, such as --deepspeed llama2_ds_zero3_config.json, the training process may fail. To ensure successful execution, we have included these details in the README.
optimum/habana/transformers/models/deepseek_v2/modeling_deepseek_v2.py
Outdated
Show resolved
Hide resolved
Computes auxiliary load balancing loss as in Switch Transformer - implemented in Pytorch. | ||
|
||
See Switch Transformer (https://arxiv.org/abs/2101.03961) for more details. This function implements the loss | ||
function presented in equations (4) - (6) of the paper. It aims at penalizing cases where the routing between | ||
experts is too unbalanced. | ||
|
||
Args: | ||
gate_logits: | ||
Logits from the `gate`, should be a tuple of model.config.num_hidden_layers tensors of | ||
shape [batch_size X sequence_length, num_experts]. | ||
num_experts: | ||
Number of experts | ||
top_k: | ||
The number of experts to route per-token, can be also interpreted as the `top-k` routing | ||
parameter. | ||
attention_mask (`torch.Tensor`, *optional*): | ||
The attention_mask used in forward function | ||
shape [batch_size X sequence_length] if not None. | ||
|
||
Returns: | ||
The auxiliary loss. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain how this method differs from the original?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The auxiliary load balancing loss here is a mechanism designed to ensure that the computational load is evenly distributed among the available experts. This loss function penalizes scenarios where the routing of tokens to experts is highly imbalanced, encouraging a more uniform distribution of tokens across the experts.
The goal is to prevent any single expert from being overloaded while others remain underutilized. This helps in maintaining efficiency and stability during training and inference.
optimum/habana/transformers/models/deepseek_v2/modeling_deepseek_v2.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll have to run make syle
to make the style check pass
optimum/habana/transformers/models/deepseek_v2/modeling_deepseek_v2.py
Outdated
Show resolved
Hide resolved
@gyou2021 There is also a merge conflict to solve |
It has been run and style problems were fixed. The style check passed. |
Solved. Thank you. |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Inference example:
python run_generation.py
--model_name_or_path
--bf16
--trim_logits
--batch_size 2
--max_input_tokens 128
--max_new_tokens 128
--use_hpu_graphs
--use_kv_cache
--reuse_cache
--bucket_size 128
--bucket_internal
--limit_hpu_graphs
--use_flash_attention
Training example:
python3 ../gaudi_spawn.py --world_size 8 --use_deepspeed run_clm.py
--config_name deepseek_v2(model_name_or_path for your local)
--tokenizer_name deepseek_v2(model_name_or_path for your local)
--dataset_name alpaca
--block_size 4096
--do_train
--num_train_epochs 1
--max_steps 10
--per_device_train_batch_size 1
--gradient_accumulation_steps 1
--use_flash_attention True
--attn_softmax_bf16 False
--gradient_checkpointing
--learning_rate 2.4e-4
--gaudi_config_name gaudi_config.json(just like Habana/gpt2)
--bf16
--save_strategy no
--no_save_last_ckpt
--output_dir path_to_out
--overwrite_output_dir
--report_to tensorboard
--logging_strategy steps
--logging_dir path_to_log
--logging_steps 1
--evaluation_strategy no
--use_habana
--use_lazy_mode
--deepspeed zero.json
Fixes # (issue)
Before submitting