Before update the tr_loss, make sure tr_loss_step is in the same device. #1439

pengwei715 · 2024-03-18T17:00:51Z

What does this PR do?

Fixes: #958
Fixes: #1399

It's a very strong assumption that tr_loss and tr_loss_step are on the same device. arg.device may not be as same as the current device.

For example, the dpo_trainer is child class of trainer. If you are running a DPO training job with one node multiple GPUs. If you set the device_map='auto'. The tr_loss_step could be on different device after computing the dpo_loss.

Based on the feedback of @guy1992l, @younesbelkada, @amyeroberts, I have open a PR in the transformers Trainer class to do a assert checking.

huggingface/transformers#29695

This PR is put the tr_loss_step to the self.args.device before update.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@younesbelkada
@amyeroberts
@guy1992l

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

… tr_loss

guy1992l

a suggestion

trl/trainer/dpo_trainer.py

Co-authored-by: guy1992l <[email protected]>

younesbelkada

Great ! Thanks so much @pengwei715 @guy1992l !

HuggingFaceDocBuilderDev · 2024-03-19T09:23:56Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

younesbelkada · 2024-03-19T09:25:07Z

just double checking @kashif is this all good in your opinion?

…ce. (huggingface#1439) * before update the loss from dpo, make sure it's in the same device of tr_loss * Update trl/trainer/dpo_trainer.py Co-authored-by: guy1992l <[email protected]> --------- Co-authored-by: guy1992l <[email protected]>

pengwei715 added 2 commits March 18, 2024 09:48

before update the loss from dpo, make sure it's in the same device of…

5628493

… tr_loss

Merge branch 'main' into feature/check_if_in_same_device

f4de299

pengwei715 mentioned this pull request Mar 18, 2024

fixed the issue of DPO trainer that using one node and mutiple GPUs and set the device_map='auto' huggingface/transformers#29695

Merged

5 tasks

guy1992l reviewed Mar 18, 2024

View reviewed changes

trl/trainer/dpo_trainer.py Show resolved Hide resolved

Update trl/trainer/dpo_trainer.py

7f5df2a

Co-authored-by: guy1992l <[email protected]>

younesbelkada approved these changes Mar 19, 2024

View reviewed changes

younesbelkada requested a review from kashif March 19, 2024 09:24

kashif approved these changes Mar 19, 2024

View reviewed changes

younesbelkada merged commit f976c6d into huggingface:main Mar 19, 2024
9 checks passed

younesbelkada mentioned this pull request Apr 8, 2024

bug-fix: avoid 'Expected all tensors to be on the same device' error when doing multi-GPU training huggingface/transformers#29144

Closed

kallewoof mentioned this pull request Apr 8, 2024

DPO QLoRA training on dual gpu fails with "Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!" axolotl-ai-cloud/axolotl#1302

Closed

8 tasks

SunMarc mentioned this pull request Aug 12, 2024

Fix orpo trainer loss device #1919

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Before update the tr_loss, make sure tr_loss_step is in the same device. #1439

Before update the tr_loss, make sure tr_loss_step is in the same device. #1439

pengwei715 commented Mar 18, 2024 •

edited by younesbelkada

Loading

guy1992l left a comment

younesbelkada left a comment

HuggingFaceDocBuilderDev commented Mar 19, 2024

younesbelkada commented Mar 19, 2024

Before update the tr_loss, make sure tr_loss_step is in the same device. #1439

Before update the tr_loss, make sure tr_loss_step is in the same device. #1439

Conversation

pengwei715 commented Mar 18, 2024 • edited by younesbelkada Loading

What does this PR do?

Before submitting

Who can review?

guy1992l left a comment

Choose a reason for hiding this comment

younesbelkada left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Mar 19, 2024

younesbelkada commented Mar 19, 2024

pengwei715 commented Mar 18, 2024 •

edited by younesbelkada

Loading