-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DPO training using multi GPU #958
Comments
device_map="auto" only allows script runned on one GPU, can you replace it with {"": Accelerator().local_process_index}, and try again? |
does the error occur in the dpo_loss method in dpo_trainer.py? example)
더 좋은 방법을 찾으시면 알려주세요 ^^; |
cc @kashif |
좋은 방법을 찾아서 공유드립니다. 감사합니다 ^^ |
hello, I load the model like this and start training, as described in the test: here is the code:
this is my requirements file:
hyperparameters:
I am runnning on g5.12xlarge machine, that has 4 A10G GPUs finally the error:
|
@sd3ntato can you also paste the command you use to run the training? |
hi, sagemaker runs
|
I tried to load on multiple GPUs as well in issue #1117 ..i didnt get tensor mismatch issue but got OOM...Using PEFT,I also got the same error..Mentioned it in the issue |
Same issue here! My error is I tried
and my DPOTrainer is:
But now I have another issue:
I found another issue #1117 that uses deepspeed zero 2 without peft config to solve this problem. I am wondering how can we use peft config while training in multi-gpus by DPOTrainer? |
Hi @Jerrrrykun |
Here is the bash file we use to run the DPO script: https://github.com/huggingface/trl/blob/main/commands/run_dpo.sh |
@younesbelkada I'm trying apply DPO RLHF on the HF "mistralai/Mixtral-8x7B-Instruct-v0.1" model on an AWS g5 multi-gpu instance |
@Amin-Tajgardoon thanks ! Line 146 in 8149303
The underlying concept is similar than the one described in: https://huggingface.co/blog/trl-peft with 4-bit base model instead of 8-bit |
@younesbelkada even with QLoRA and |
true, then for distributing the model across different GPUs you need to load the model with |
@Amin-Tajgardoon I am experiencing the same issue as you do. Did you manage to find a fix? |
I guess the key point is Executing To work around this, I used deepspeed --num_gpus 1 dpo.py and stored the pre_computed dataset on disk by modifying "get_train_dataloader" in DPOTrainer. This approach proved effective as both generation and training proceeded smoothly without requiring an additional 7B model loaded into VRAM. I think that detailed documentation on how DPO integrates with precompute_ref_log_probs would be incredibly beneficial. It would be more helpful if examples using 'accelerate' and 'deepspeed' were made available~ |
Thanks! Hmm I see, indeed there might be some issues between DPO, DeepSpeed and |
Hi team. I think the root cause of the issue is inside of the |
Hi, I use the config file and shell script, but still got a error,
which is similar to the case shown in here. And I successfully launched my script by setting Here is my launch script: export NCCL_P2P_DISABLE=1
export CUDA_VISIBLE_DEVICES=4,5,6,7
accelerate launch --num_processes=4 --config_file multi_gpu.yaml run_dpo.py \
--dataset_name $hf_dataset_home/trl-lib/ultrafeedback_binarized \
--model_name_or_path $hf_home/meta-llama/Meta-Llama-3-8B-Instruct \
--learning_rate 5.0e-6 \
--num_train_epochs 1 \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 2 \
--gradient_checkpointing \
--logging_steps 25 \
--eval_strategy steps \
--eval_steps 50 \
--output_dir $output_dir \
--no_remove_unused_columns \
--use_peft \
--dataset_num_proc 16 \
--lora_r 32 \
--lora_alpha 16 > $log_dir/$cur_time.log 2>&1 & Please let me know if you have some idea. Thanks a lot.🙏 |
Here is my load model code....
`
model = AutoModelForCausalLM.from_pretrained(
script_args.model_name_or_path,
low_cpu_mem_usage=True,
torch_dtype=torch.float16,
load_in_4bit=True,
device_map="auto",
trust_remote_code=True)
`
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!
I have to try erase quantization, how ever it occurs OOM...
I'm training on RTX3090X4EA GPU environment.
I need to split the GPU (for GPU memory), but if I put the dataset into DPO Trainer in Dataset format now, I get this error code, what should I do?
I'm fix example of DPO training code and using it.
The text was updated successfully, but these errors were encountered: