Optimized DeepSeek-v2 on Gaudi #1677

gyou2021 · 2025-01-06T08:59:59Z

Optimized MLA attention for inference;
Optimzed MoE with dynamic MoE and expert slice for inference;
Optimzed cache for inference;
Optimized training.

Inference example:
python run_generation.py
--model_name_or_path
--bf16
--trim_logits
--batch_size 2
--max_input_tokens 128
--max_new_tokens 128
--use_hpu_graphs
--use_kv_cache
--reuse_cache
--bucket_size 128
--bucket_internal
--limit_hpu_graphs
--use_flash_attention

Training example:
python3 ../gaudi_spawn.py --world_size 8 --use_deepspeed run_clm.py
--config_name deepseek_v2(model_name_or_path for your local)
--tokenizer_name deepseek_v2(model_name_or_path for your local)
--dataset_name alpaca
--block_size 4096
--do_train
--num_train_epochs 1
--max_steps 10
--per_device_train_batch_size 1
--gradient_accumulation_steps 1
--use_flash_attention True
--attn_softmax_bf16 False
--gradient_checkpointing
--learning_rate 2.4e-4
--gaudi_config_name gaudi_config.json(just like Habana/gpt2)
--bf16
--save_strategy no
--no_save_last_ckpt
--output_dir path_to_out
--overwrite_output_dir
--report_to tensorboard
--logging_strategy steps
--logging_dir path_to_log
--logging_steps 1
--evaluation_strategy no
--use_habana
--use_lazy_mode
--deepspeed zero.json

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

…deepseek-v2-chat-0628

emascarenhas · 2025-01-06T23:43:37Z

@gyou2021 ,
There are errors in style. Please run "make style" and fix the errors that are reported.

Do you need to add or modify tests?

Suggest running all the text generation slow tests with and without change so that no new issues are seen from this change. Please post results here.
GAUDI2_CI=1 RUN_SLOW=1 python -m pytest tests/test_text_generation_example.py

Is there any impact on fine tuning or training scripts?
Are any changes to README required per the command line?

Edward

xuguangxin · 2025-01-07T02:52:14Z

@mengker33 @xhaihao @hlin99 please help review too.
Thank you.

ranzhejiang · 2025-01-08T07:06:18Z

@gyou2021 , There are errors in style. Please run "make style" and fix the errors that are reported.

Do you need to add or modify tests?

Suggest running all the text generation slow tests with and without change so that no new issues are seen from this change. Please post results here. GAUDI2_CI=1 RUN_SLOW=1 python -m pytest tests/test_text_generation_example.py

Is there any impact on fine tuning or training scripts? Are any changes to README required per the command line?

Edward

Thank you for review, for training scripts, we just only add some missing fused kernel config arguments for run_clm.py following the already existing run_lora_clm.py.

gyou2021 · 2025-01-08T14:59:44Z

@gyou2021 , There are errors in style. Please run "make style" and fix the errors that are reported.
Do you need to add or modify tests?
Suggest running all the text generation slow tests with and without change so that no new issues are seen from this change. Please post results here. GAUDI2_CI=1 RUN_SLOW=1 python -m pytest tests/test_text_generation_example.py
Is there any impact on fine tuning or training scripts? Are any changes to README required per the command line?
Edward

Thank you for review, for training scripts, we just only add some missing fused kernel config arguments for run_clm.py following the already existing run_lora_clm.py.

Thank you for your comments.
Fixed style errors .
No, it is not needed to add or modify tests for deepseek-v2 inference since the PR is about its optimization.
Updated DeepSeek-v2 in README.
The Deepseek-V2-Lite test case passed the pytest.

examples/language-modeling/run_clm.py

optimum/habana/transformers/models/deepseek_v2/modeling_deepseek_v2.py

emascarenhas · 2025-01-13T21:14:00Z

Hi @gyou2021 , Can you provide a command line you expect to work with Deepseek-v2 and with run_clm.py?

gyou2021 · 2025-01-14T02:38:03Z

Hi @gyou2021 , Can you provide a command line you expect to work with Deepseek-v2 and with run_clm.py?

python3 ../gaudi_spawn.py --world_size 8 --use_deepspeed run_clm.py
--config_name deepseek_v2(model_name_or_path for your local)
--tokenizer_name deepseek_v2(model_name_or_path for your local)
--dataset_name alpaca
--block_size 4096
--do_train
--num_train_epochs 1
--max_steps 10
--per_device_train_batch_size 1
--gradient_accumulation_steps 1
--use_flash_attention True
--attn_softmax_bf16 False
--gradient_checkpointing
--learning_rate 2.4e-4
--gaudi_config_name gaudi_config.json(just like Habana/gpt2)
--bf16
--save_strategy no
--no_save_last_ckpt
--output_dir path_to_out
--overwrite_output_dir
--report_to tensorboard
--logging_strategy steps
--logging_dir path_to_log
--logging_steps 1
--evaluation_strategy no
--use_habana
--use_lazy_mode
--deepspeed zero.json

emascarenhas · 2025-01-15T16:35:10Z

I could not run the command provided as is since the config file, etc weren't local. Here is a command that works which will pull the model from HF. Similar commands using other Huggingface DeepSeek models should also work.

python3 ../gaudi_spawn.py --world_size 8 --use_deepspeed run_clm.py
--config_name deepseek-ai/DeepSeek-V2-Lite
--tokenizer_name deepseek-ai/DeepSeek-V2-Lite
--dataset_name tatsu-lab/alpaca
--block_size 4096
--do_train
--num_train_epochs 1
--max_steps 10
--per_device_train_batch_size 1
--gradient_accumulation_steps 1
--use_flash_attention True
--attn_softmax_bf16 False
--learning_rate 2.4e-4
--gaudi_config_name Habana/gpt2
--bf16
--save_strategy no
--no_save_last_ckpt
--output_dir path_to_out
--overwrite_output_dir
--report_to tensorboard
--logging_strategy steps
--logging_dir path_to_log
--logging_steps 1
--evaluation_strategy no
--use_habana
--use_lazy_mode

ranzhejiang · 2025-01-15T16:42:48Z

I could not run the command provided as is since the config file, etc weren't local. Here is a command that works which will pull the model from HF. Similar commands using other Huggingface DeepSeek models should also work.

python3 ../gaudi_spawn.py --world_size 8 --use_deepspeed run_clm.py --config_name deepseek-ai/DeepSeek-V2-Lite --tokenizer_name deepseek-ai/DeepSeek-V2-Lite --dataset_name tatsu-lab/alpaca --block_size 4096 --do_train --num_train_epochs 1 --max_steps 10 --per_device_train_batch_size 1 --gradient_accumulation_steps 1 --use_flash_attention True --attn_softmax_bf16 False --learning_rate 2.4e-4 --gaudi_config_name Habana/gpt2 --bf16 --save_strategy no --no_save_last_ckpt --output_dir path_to_out --overwrite_output_dir --report_to tensorboard --logging_strategy steps --logging_dir path_to_log --logging_steps 1 --evaluation_strategy no --use_habana --use_lazy_mode

Thank you for testing, and now it can work with your command, if you meet OOM, you can add --deepspeed zero3.json or --gradient_checkpointing to reduce memory.

optimum/habana/transformers/models/deepseek_v2/modeling_deepseek_v2.py

emascarenhas · 2025-01-16T18:09:12Z

I was able to run training successfully with the additional options suggested.
The command appears to hang if the deepspeed config is not passed, it should have errored out with OOM.
Here is a command that completes successfully.
python3 ../gaudi_spawn.py --world_size 8 --use_deepspeed run_clm.py
--config_name deepseek-ai/DeepSeek-V2-Lite
--tokenizer_name deepseek-ai/DeepSeek-V2-Lite
--dataset_name tatsu-lab/alpaca
--block_size 4096
--do_train
--num_train_epochs 1
--max_steps 10
--per_device_train_batch_size 1
--gradient_accumulation_steps 1
--use_flash_attention True
--attn_softmax_bf16 False
--gradient_checkpointing
--learning_rate 2.4e-4
--gaudi_config_name Habana/gpt2
--bf16
--save_strategy no
--no_save_last_ckpt
--output_dir /root/deepseek-v2-lite
--overwrite_output_dir
--logging_strategy steps
--logging_dir /root/deepseek-v2-lite/log
--logging_steps 1
--evaluation_strategy no
--use_habana
--use_lazy_mode
--deepspeed llama2_ds_zero3_config.json

gyou2021 · 2025-01-17T09:49:09Z

We added the training command you verified into the README.txt of language modeling.

I was able to run training successfully with the additional options suggested. The command appears to hang if the deepspeed config is not passed, it should have errored out with OOM. Here is a command that completes successfully. python3 ../gaudi_spawn.py --world_size 8 --use_deepspeed run_clm.py --config_name deepseek-ai/DeepSeek-V2-Lite --tokenizer_name deepseek-ai/DeepSeek-V2-Lite --dataset_name tatsu-lab/alpaca --block_size 4096 --do_train --num_train_epochs 1 --max_steps 10 --per_device_train_batch_size 1 --gradient_accumulation_steps 1 --use_flash_attention True --attn_softmax_bf16 False --gradient_checkpointing --learning_rate 2.4e-4 --gaudi_config_name Habana/gpt2 --bf16 --save_strategy no --no_save_last_ckpt --output_dir /root/deepseek-v2-lite --overwrite_output_dir --logging_strategy steps --logging_dir /root/deepseek-v2-lite/log --logging_steps 1 --evaluation_strategy no --use_habana --use_lazy_mode --deepspeed llama2_ds_zero3_config.json

emascarenhas · 2025-01-22T15:42:48Z

@libinta , Looks good to me to proceed to run-test & merge approval.

regisss

I left a few comments.
Also, have you checked that this test still passes?

README.md

regisss · 2025-01-27T17:05:14Z

examples/language-modeling/README.md

+### Multi-card Training with Deepspeed (DeepSeek-V2-Lite)
+```bash
+python ../gaudi_spawn.py --world_size 8 --use_deepspeed run_clm.py
+    --config_name deepseek-ai/DeepSeek-V2-Lite
+    --tokenizer_name deepseek-ai/DeepSeek-V2-Lite
+    --dataset_name tatsu-lab/alpaca
+    --block_size 4096
+    --do_train
+    --num_train_epochs 1
+    --max_steps 10
+    --per_device_train_batch_size 1
+    --gradient_accumulation_steps 1
+    --use_flash_attention True
+    --attn_softmax_bf16 False
+    --gradient_checkpointing
+    --learning_rate 2.4e-4
+    --gaudi_config_name Habana/gpt2
+    --bf16
+    --save_strategy no
+    --no_save_last_ckpt
+    --output_dir /root/deepseek-v2-lite
+    --overwrite_output_dir
+    --logging_strategy steps
+    --logging_dir /root/deepseek-v2-lite/log
+    --logging_steps 1
+    --evaluation_strategy no
+    --use_habana
+    --use_lazy_mode
+    --deepspeed llama2_ds_zero3_config.json
+```


IS there any specific arg here that is specific to this model? Otherwise let's just try to keep the README concise.

Certain arguments, such as --use_flash_attention, enable optimization during training. Some of these argument values were derived from our experiments. The command has been verified on both the local server and DevCloud. If any arguments are missing, such as --deepspeed llama2_ds_zero3_config.json, the training process may fail. To ensure successful execution, we have included these details in the README.

optimum/habana/transformers/models/deepseek_v2/modeling_deepseek_v2.py

regisss · 2025-01-27T17:09:53Z

optimum/habana/transformers/models/deepseek_v2/modeling_deepseek_v2.py

+    Computes auxiliary load balancing loss as in Switch Transformer - implemented in Pytorch.
+
+    See Switch Transformer (https://arxiv.org/abs/2101.03961) for more details. This function implements the loss
+    function presented in equations (4) - (6) of the paper. It aims at penalizing cases where the routing between
+    experts is too unbalanced.
+
+    Args:
+        gate_logits:
+            Logits from the `gate`, should be a tuple of model.config.num_hidden_layers tensors of
+            shape [batch_size X sequence_length, num_experts].
+        num_experts:
+            Number of experts
+        top_k:
+            The number of experts to route per-token, can be also interpreted as the `top-k` routing
+            parameter.
+        attention_mask (`torch.Tensor`, *optional*):
+            The attention_mask used in forward function
+            shape [batch_size X sequence_length] if not None.
+
+    Returns:
+        The auxiliary loss.
+    """


Can you explain how this method differs from the original?

The auxiliary load balancing loss here is a mechanism designed to ensure that the computational load is evenly distributed among the available experts. This loss function penalizes scenarios where the routing of tokens to experts is highly imbalanced, encouraging a more uniform distribution of tokens across the experts.
The goal is to prevent any single expert from being overloaded while others remain underutilized. This helps in maintaining efficiency and stability during training and inference.

optimum/habana/transformers/models/deepseek_v2/modeling_deepseek_v2.py

gyou2021 · 2025-01-28T10:40:42Z

this test still passes?

Yes. Our tests verified this test about DeepSeek-V2-Lite still passes.

regisss

You'll have to run make syle to make the style check pass

optimum/habana/transformers/models/deepseek_v2/modeling_deepseek_v2.py

regisss · 2025-01-28T11:00:54Z

@gyou2021 There is also a merge conflict to solve

gyou2021 · 2025-01-28T11:26:45Z

You'll have to run make syle to make the style check pass

It has been run and style problems were fixed. The style check passed.

gyou2021 · 2025-01-28T11:27:24Z

@gyou2021 There is also a merge conflict to solve

Solved. Thank you.

HuggingFaceDocBuilderDev · 2025-01-28T22:08:45Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

gyou2021 and others added 5 commits January 6, 2025 07:48

Optimized attention and MoE of deepseek_v2 on Gaudi

1903f43

Optimized cache

33c0930

enable training with sdpa

5b0bc2b

add gaudi fused kernel config for deepseek_v2

1b53cf8

Added expert slice to support large scale DeepSeek-v2 models such as …

aaae320

…deepseek-v2-chat-0628

gyou2021 requested review from ssarkar2, bhargaveede, vivekgoe and regisss as code owners January 6, 2025 09:00

add fused kernel config support for run_clm.py

c982121

gyou2021 added 2 commits January 8, 2025 15:09

Fixed style errors

8b8e5c5

Updated DeepSeek-V2 in README

cd79a76

emascarenhas reviewed Jan 10, 2025

View reviewed changes

examples/language-modeling/run_clm.py Outdated Show resolved Hide resolved

emascarenhas reviewed Jan 10, 2025

View reviewed changes

examples/language-modeling/run_clm.py Outdated Show resolved Hide resolved

emascarenhas reviewed Jan 10, 2025

View reviewed changes

optimum/habana/transformers/models/deepseek_v2/modeling_deepseek_v2.py Show resolved Hide resolved

emascarenhas reviewed Jan 10, 2025

View reviewed changes

optimum/habana/transformers/models/deepseek_v2/modeling_deepseek_v2.py Show resolved Hide resolved

emascarenhas reviewed Jan 10, 2025

View reviewed changes

optimum/habana/transformers/models/deepseek_v2/modeling_deepseek_v2.py Show resolved Hide resolved

remove some outdate statements

1e04748

mengker33 reviewed Jan 16, 2025

View reviewed changes

optimum/habana/transformers/models/deepseek_v2/modeling_deepseek_v2.py Show resolved Hide resolved

gyou2021 added 2 commits January 17, 2025 08:55

Added DeepSeek_v2 to reuse_cache assert

146a9d3

Added the training command of DeepSeek-V2-Lite.

71d54d8

libinta added the run-test Run CI for PRs from external contributors label Jan 22, 2025

regisss reviewed Jan 27, 2025

View reviewed changes

Added support for deepseek-v2 training

57a86b6

Refactor the code.

8a4c1a8

regisss reviewed Jan 28, 2025

View reviewed changes

optimum/habana/transformers/models/deepseek_v2/modeling_deepseek_v2.py Outdated Show resolved Hide resolved

gyou2021 added 2 commits January 28, 2025 11:10

Fixed conflicts.

5e216f8

Removed comments

02a5adc

Fixed style.

62d6313

regisss approved these changes Jan 28, 2025

View reviewed changes

regisss merged commit f50938c into huggingface:main Jan 28, 2025
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimized DeepSeek-v2 on Gaudi #1677

Optimized DeepSeek-v2 on Gaudi #1677

gyou2021 commented Jan 6, 2025 •

edited

Loading

emascarenhas commented Jan 6, 2025

xuguangxin commented Jan 7, 2025

ranzhejiang commented Jan 8, 2025

gyou2021 commented Jan 8, 2025

emascarenhas commented Jan 13, 2025

gyou2021 commented Jan 14, 2025 •

edited

Loading

emascarenhas commented Jan 15, 2025

ranzhejiang commented Jan 15, 2025

emascarenhas commented Jan 16, 2025

gyou2021 commented Jan 17, 2025

emascarenhas commented Jan 22, 2025

regisss left a comment

regisss Jan 27, 2025

gyou2021 Jan 28, 2025 •

edited

Loading

regisss Jan 27, 2025

gyou2021 Jan 28, 2025

gyou2021 commented Jan 28, 2025

regisss left a comment

regisss commented Jan 28, 2025

gyou2021 commented Jan 28, 2025 •

edited

Loading

gyou2021 commented Jan 28, 2025

HuggingFaceDocBuilderDev commented Jan 28, 2025

Optimized DeepSeek-v2 on Gaudi #1677

Optimized DeepSeek-v2 on Gaudi #1677

Conversation

gyou2021 commented Jan 6, 2025 • edited Loading

Before submitting

emascarenhas commented Jan 6, 2025

xuguangxin commented Jan 7, 2025

ranzhejiang commented Jan 8, 2025

gyou2021 commented Jan 8, 2025

emascarenhas commented Jan 13, 2025

gyou2021 commented Jan 14, 2025 • edited Loading

emascarenhas commented Jan 15, 2025

ranzhejiang commented Jan 15, 2025

emascarenhas commented Jan 16, 2025

gyou2021 commented Jan 17, 2025

emascarenhas commented Jan 22, 2025

regisss left a comment

Choose a reason for hiding this comment

regisss Jan 27, 2025

Choose a reason for hiding this comment

gyou2021 Jan 28, 2025 • edited Loading

Choose a reason for hiding this comment

regisss Jan 27, 2025

Choose a reason for hiding this comment

gyou2021 Jan 28, 2025

Choose a reason for hiding this comment

gyou2021 commented Jan 28, 2025

regisss left a comment

Choose a reason for hiding this comment

regisss commented Jan 28, 2025

gyou2021 commented Jan 28, 2025 • edited Loading

gyou2021 commented Jan 28, 2025

HuggingFaceDocBuilderDev commented Jan 28, 2025

gyou2021 commented Jan 6, 2025 •

edited

Loading

gyou2021 commented Jan 14, 2025 •

edited

Loading

gyou2021 Jan 28, 2025 •

edited

Loading

gyou2021 commented Jan 28, 2025 •

edited

Loading