Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimized DeepSeek-v2 on Gaudi #1677

Merged
merged 16 commits into from
Jan 28, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -279,7 +279,7 @@ The following model architectures, tasks and device distributions have been vali
| Mllama | <li>LoRA</li> | :heavy_check_mark: | <li>[image to text](https://github.com/huggingface/optimum-habana/tree/main/examples/image-to-text)</li> |
| MiniCPM3 | | <li>Single card</li> | <li>[text generation](https://github.com/huggingface/optimum-habana/tree/main/examples/text-generation)</li> |
| Baichuan2 | <li>DeepSpeed</li> | <li>Single card</li> | <li>[language modeling](https://github.com/huggingface/optimum-habana/tree/main/examples/language-modeling)</li><li>[text generation](https://github.com/huggingface/optimum-habana/tree/main/examples/text-generation)</li> |
| DeepSeek-V2 | | :heavy_check_mark: | <li>[text generation](https://github.com/huggingface/optimum-habana/tree/main/examples/text-generation)</li> |
| DeepSeek-V2 | :heavy_check_mark: | :heavy_check_mark: | <li>[text generation](https://github.com/huggingface/optimum-habana/tree/main/examples/text-generation)</li> |
gyou2021 marked this conversation as resolved.
Show resolved Hide resolved
| ChatGLM | <li>DeepSpeed</li> | <li>Single card</li> | <li>[language modeling](https://github.com/huggingface/optimum-habana/tree/main/examples/language-modeling)</li><li>[text generation](https://github.com/huggingface/optimum-habana/tree/main/examples/text-generation)</li> |


Expand Down
2 changes: 1 addition & 1 deletion docs/source/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -107,7 +107,7 @@ In the tables below, ✅ means single-card, multi-card and DeepSpeed have all be
| Mllama | <div style="text-align:left"><li>LoRA</li></div> |✅ | <li>[image to text](https://github.com/huggingface/optimum-habana/tree/main/examples/image-to-text)</li> |
| MiniCPM3 | | <div style="text-align:left"><li>Single card</li></div> | <li>[text generation](https://github.com/huggingface/optimum-habana/tree/main/examples/text-generation)</li> |
| Baichuan2 | <div style="text-align:left"><li>DeepSpeed</li></div> | <div style="text-align:left"><li>Single card</li></div> | <li>[language modeling](https://github.com/huggingface/optimum-habana/tree/main/examples/language-modeling)</li><li>[text generation](https://github.com/huggingface/optimum-habana/tree/main/examples/text-generation)</li> |
| DeepSeek-V2 | | ✅ | <li>[text generation](https://github.com/huggingface/optimum-habana/tree/main/examples/text-generation)</li> |
| DeepSeek-V2 | | ✅ | <li>[text generation](https://github.com/huggingface/optimum-habana/tree/main/examples/text-generation)</li> |
| ChatGLM | <div style="text-align:left"><li>DeepSpeed</li></div> | <div style="text-align:left"><li>Single card</li></div> | <li>[language modeling](https://github.com/huggingface/optimum-habana/tree/main/examples/language-modeling)</li><li>[text generation](https://github.com/huggingface/optimum-habana/tree/main/examples/text-generation)</li> |

- Diffusers
Expand Down
30 changes: 30 additions & 0 deletions examples/language-modeling/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -184,6 +184,36 @@ python ../gaudi_spawn.py \
--logging_steps 20
```

### Multi-card Training with Deepspeed (DeepSeek-V2-Lite)
```bash
python ../gaudi_spawn.py --world_size 8 --use_deepspeed run_clm.py
--config_name deepseek-ai/DeepSeek-V2-Lite
--tokenizer_name deepseek-ai/DeepSeek-V2-Lite
--dataset_name tatsu-lab/alpaca
--block_size 4096
--do_train
--num_train_epochs 1
--max_steps 10
--per_device_train_batch_size 1
--gradient_accumulation_steps 1
--use_flash_attention True
--attn_softmax_bf16 False
--gradient_checkpointing
--learning_rate 2.4e-4
--gaudi_config_name Habana/gpt2
--bf16
--save_strategy no
--no_save_last_ckpt
--output_dir /root/deepseek-v2-lite
--overwrite_output_dir
--logging_strategy steps
--logging_dir /root/deepseek-v2-lite/log
--logging_steps 1
--evaluation_strategy no
--use_habana
--use_lazy_mode
--deepspeed llama2_ds_zero3_config.json
```
Comment on lines +187 to +216
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IS there any specific arg here that is specific to this model? Otherwise let's just try to keep the README concise.

Copy link
Contributor Author

@gyou2021 gyou2021 Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Certain arguments, such as --use_flash_attention, enable optimization during training. Some of these argument values were derived from our experiments. The command has been verified on both the local server and DevCloud. If any arguments are missing, such as --deepspeed llama2_ds_zero3_config.json, the training process may fail. To ensure successful execution, we have included these details in the README.


## Multi-Node Training with Deepspeed (GPT-NeoX)

Expand Down
34 changes: 34 additions & 0 deletions examples/language-modeling/run_clm.py
Original file line number Diff line number Diff line change
Expand Up @@ -156,6 +156,32 @@ class ModelArguments:
)
},
)
attn_softmax_bf16: bool = field(
default=False,
metadata={"help": ("Whether to run attention softmax layer in bf16 precision for fine-tuning.")},
)
use_flash_attention: bool = field(
default=False,
metadata={"help": ("Whether to use Habana flash attention for fine-tuning.")},
)
flash_attention_recompute: bool = field(
default=False,
metadata={
"help": (
"Whether to enable recompute in Habana flash attention for fine-tuning."
" It is applicable only when use_flash_attention is True."
)
},
)
flash_attention_causal_mask: bool = field(
default=False,
metadata={
"help": (
"Whether to enable causal mask in Habana flash attention for fine-tuning."
" It is applicable only when use_flash_attention is True."
)
},
)
low_cpu_mem_usage: bool = field(
default=False,
metadata={
Expand Down Expand Up @@ -482,6 +508,14 @@ def main():
if len(tokenizer) > embedding_size:
model.resize_token_embeddings(len(tokenizer))

# We need to add these fused kernels config
if model_args.attn_softmax_bf16:
model.generation_config.attn_softmax_bf16 = True
if model_args.use_flash_attention:
model.generation_config.use_flash_attention = True
model.generation_config.flash_attention_recompute = model_args.flash_attention_recompute
model.generation_config.flash_attention_causal_mask = model_args.flash_attention_causal_mask

# Preprocessing the datasets.
# First we tokenize all the texts.
if training_args.do_train:
Expand Down
4 changes: 3 additions & 1 deletion optimum/habana/transformers/generation/utils.py
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -1092,8 +1092,9 @@ def generate(
"gemma2",
"baichuan",
"chatglm",
"deepseek_v2",
], (
"reuse_cache only supported by llama, mistral, falcon, mixtral, phi, qwen2, qwen2_moe, gemma, gemma2, starcoder2, baichuan and chatglm at the moment"
"reuse_cache only supported by llama, mistral, falcon, mixtral, phi, qwen2, qwen2_moe, gemma, gemma2, starcoder2, baichuan, chatglm and deepseek_v2 at the moment"
)
if not generation_config.bucket_internal:
assert generation_config.bucket_size <= 0, (
Expand Down Expand Up @@ -1300,6 +1301,7 @@ def generate(
"gemma2",
"qwen2_moe",
"baichuan",
"deepseek_v2",
]:
if (
hasattr(self.config, "max_position_embeddings")
Expand Down
Loading