Skip to content

Commit

Permalink
Update language-modeling README file (huggingface#1599)
Browse files Browse the repository at this point in the history
Co-authored-by: Libin Tang <[email protected]>
Co-authored-by: regisss <[email protected]>
  • Loading branch information
3 people authored Dec 12, 2024
1 parent dd54fda commit d3973e0
Showing 1 changed file with 47 additions and 0 deletions.
47 changes: 47 additions & 0 deletions examples/language-modeling/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -926,6 +926,53 @@ python3 ../gaudi_spawn.py --world_size 8 --use_mpi peft_poly_seq2seq_with_genera
--trust_remote_code
```

### Training models with Long Sequence lengths
We have added support for [Deepspeed Ulysses](https://github.com/microsoft/DeepSpeed/blob/master/blogs/deepspeed-ulysses/README.md). This allows us to train large transformer models using very long sequence length inputs with limited HW resources. This feature has been tested using LLama3.1-8B & LLama3.1-70B fine-tuning with input sequence lengths of 32k on 8xGaudi3 cards. Reference command for LLama3.1-8B fine-tuning is shared below.

`--context_parallel_size` sets the number of cards single input sequences will get mapped to, e.g., setting `context_parallel_size=4` with `max_seq_len=32k` will result in each card processing input chunks of length 8k each (thereby reducing memory requirement for activations). This feature can be combined with Zero-3 to enable scaling not only to large sequence lengths but also to large size models.

> [!NOTE]
> This feature is still in beta version and may not work out of the box for all transformer model architectures and configurations.
```bash
HL_DS_DISTRIBUTED_ATTENTION_SEQ_DIM=1 \
python3 ../gaudi_spawn.py \
--world_size 8 --use_deepspeed run_lora_clm.py \
--model_name_or_path meta-llama/Llama-3.1-8B \
--dataset_name tatsu-lab/alpaca \
--bf16 True \
--output_dir /tmp/lora_out \
--max_seq_len 32768 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--save_strategy no \
--learning_rate 0.0004 \
--warmup_ratio 0.03 \
--lr_scheduler_type "constant" \
--logging_steps 1 \
--dataset_concatenation \
--do_train \
--use_habana \
--throughput_warmup_steps 3 \
--lora_rank 8 \
--lora_target_modules "q_proj" "v_proj" "k_proj" "o_proj" \
--attn_softmax_bf16 True \
--validation_split_percentage 4 \
--flash_attention_causal_mask True \
--evaluation_strategy epoch \
--pipelining_fwd_bwd \
--use_lazy_mode \
--use_flash_attention True \
--deepspeed llama3_ds_zero1_config.json \
--num_train_epochs 3 \
--eval_delay 3 \
--do_eval \
--lora_alpha 16 \
--lora_dropout 0.05 \
--gradient_accumulation_steps 4 \
--flash_attention_recompute True \
--context_parallel_size 4
```

## Streaming

Expand Down

0 comments on commit d3973e0

Please sign in to comment.