Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PPO example not working with DeepSpeed Stage 3 or FSDP #1051

Closed
mgerstgrasser opened this issue Dec 2, 2023 · 15 comments
Closed

PPO example not working with DeepSpeed Stage 3 or FSDP #1051

mgerstgrasser opened this issue Dec 2, 2023 · 15 comments

Comments

@mgerstgrasser
Copy link
Contributor

I've been trying to get a PPO trainer to work with fully sharded training using either DeepSpeed stage 3 or FSDP. However, no matter what exact configuration options I try, I cannot get even the example in the documentation to work. It seems the problems are with calling trainer.generate() when sampling a rollout. With FSDP, it usually crashes, with the exact error message depending on exact accelerate config (e.g. pytorch/pytorch#82461 ) With DeepSpeed, the script seems to just hang and time out, without an error message.

Is this known behavior, and is there a working example or documentation of PPO + Deepspeed/FSDP anywhere?

To reproduce, inside examples:
accelerate launch --config_file accelerate_configs/deepspeed_zero3.yaml scripts/ppo.py
or even accelerate launch --config_file accelerate_configs/deepspeed_zero3.yaml helloworld.py

@kangqiyue
Copy link

I've been trying to get a PPO trainer to work with fully sharded training using either DeepSpeed stage 3 or FSDP. However, no matter what exact configuration options I try, I cannot get even the example in the documentation to work. It seems the problems are with calling when sampling a rollout. With FSDP, it usually crashes, with the exact error message depending on exact accelerate config (e.g. pytorch/pytorch#82461 ) With DeepSpeed, the script seems to just hang and time out, without an error message.trainer.generate()

Is this known behavior, and is there a working example or documentation of PPO + Deepspeed/FSDP anywhere?

To reproduce, inside : or even examples``accelerate launch --config_file accelerate_configs/deepspeed_zero3.yaml scripts/ppo.py``accelerate launch --config_file accelerate_configs/deepspeed_zero3.yaml helloworld.py

I met the same problem, any idea to solve this?

@mgerstgrasser
Copy link
Contributor Author

mgerstgrasser commented Dec 2, 2023

OK, so I've been doing more digging, and here's what I've found so far. The following is all with the scripts/ppo.py example. @kangqiyue

  • DeepSpeed: Example does work, just very slow.
    It turns out, the example does work with Deepspeed - it's just that it's so slow, I thought it was stuck. But if I turn down the batch and minibatch size to 4 or 16, it does run okay - still pretty slow though. A few things I've noticed:
    • Deepspeed is much slower than running the script without accelerate, even with a single GPU (roughly 5x as slow as running the script without accelerate launch). With 8x GPU, it takes another 4x hit, roughly. Running it in a debugger ats another 2x slowdown. Not sure if all of this is expected.
    • Deepspeed is especially slow on the second iteration, for some reason, and it seems to be in the trainer.generate() call. This happens only in the second iteration for some reason, and it's order of magnitude slower - think one hour (!) with the original batch size of 128, with 8xA100 GPUs. Hence my previous thinking that it was just frozen completely. The first iteration is fine, and any subsequent iterations seem fine.
    • I am getting Deepspeed warnings along the lines of Invalidate trace cache @ step 0: expected module 0, but got module 1 - possibly this is related to the slowdown?
    • If I try to modify the script to run it with a different model (llama-2-7b), Deepspeed crashes with an error, although now the problem seems to be in the backward step. The error is
AssertionError    assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary()
: {'id': 331, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 16777216, 'shape': (0,), 'ds_shape': (4096, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': {66}, 'ds_tensor.shape': torch.Size([2097152])}

Possibly relevant discussion here - what's interesting is that for them it also seems to happen when training after a forward call.

  • With FSDP, I get the error
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: The tensor has a non-zero number of elements, but its data is not allocated yet. Caffe2 uses a lazy allocation, so you will need to call mutable_data() or raw_mutable_data() to actually allocate memory.

That's also discussed e.g. here or here. Some comments in that discussion mention that running a forward pass through the model first solved the problem for them, but I haven't been able to make that work (yet).

That's about as far as I got so far.

@kangqiyue
Copy link

kangqiyue commented Dec 3, 2023

OK, so I've been doing more digging, and here's what I've found so far. The following is all with the scripts/ppo.py example. @kangqiyue

  • DeepSpeed: Example does work, just very slow.
    It turns out, the example does work with Deepspeed - it's just that it's so slow, I thought it was stuck. But if I turn down the batch and minibatch size to 4 or 16, it does run okay - still pretty slow though. A few things I've noticed:

    • Deepspeed is much slower than running the script without accelerate, even with a single GPU (roughly 5x as slow as running the script without accelerate launch). With 8x GPU, it takes another 4x hit, roughly. Running it in a debugger ats another 2x slowdown. Not sure if all of this is expected.
    • Deepspeed is especially slow on the second iteration, for some reason, and it seems to be in the trainer.generate() call. This happens only in the second iteration for some reason, and it's order of magnitude slower - think one hour (!) with the original batch size of 128, with 8xA100 GPUs. Hence my previous thinking that it was just frozen completely. The first iteration is fine, and any subsequent iterations seem fine.
    • I am getting Deepspeed warnings along the lines of Invalidate trace cache @ step 0: expected module 0, but got module 1 - possibly this is related to the slowdown?
    • If I try to modify the script to run it with a different model (llama-2-7b), Deepspeed crashes with an error, although now the problem seems to be in the backward step. The error is
AssertionError    assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary()
: {'id': 331, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 16777216, 'shape': (0,), 'ds_shape': (4096, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': {66}, 'ds_tensor.shape': torch.Size([2097152])}

Possibly relevant discussion here - what's interesting is that for them it also seems to happen when training after a forward call.

  • With FSDP, I get the error
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: The tensor has a non-zero number of elements, but its data is not allocated yet. Caffe2 uses a lazy allocation, so you will need to call mutable_data() or raw_mutable_data() to actually allocate memory.

That's also discussed e.g. here or here. Some comments in that discussion mention that running a forward pass through the model first solved the problem for them, but I haven't been able to make that work (yet).

That's about as far as I got so far.

@mgerstgrasser

  • I run the script with accelerate, and I tried Deepspeed stage 2 and stage 3. For Deepspeed stage 3, I met the same question: for the zero iteration took too a long time and the error is shown as below:

RuntimeError: NCCL communicator was aborted on rank 0. Original reason for failure was: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1092, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805633 milliseconds before timing out. [E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.

  • Therefore I tried Deepspeed stage 2, and I met OOM. So, I have to change from full finetue to lora finetune. For this, setting, I can run the script on 1A100. But when I tried to use 8A100, then it meet device error:

return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select) Traceback (most recent call last):

  • Finally, I changed to a much light reward model, and I run the script with Deepspeed stage 2. The iteration is fine, except the generation process is much slow. If I generate 256 tokens, the iteration, the time is 280.36s/it. However, I run this script with Deepspeed stage 2. It will not stuck at the first iteration. Try Deepspeed stage 2 for finetuning! (note: need to freeze share layers)

  • Invalidate trace cache @ step 0: expected module 0, but got module 1. I also got this message error, and I don't know how to do with it. I think we should dig more on this warning.

  • For FSDP, I did not use it, therefore I have no idea how to run FDSP.

@kangqiyue
Copy link

kangqiyue commented Dec 3, 2023

@mgerstgrasser @lvwerra

  • In addition, when I start training, the rewards decrease, as follows:

Snipaste_2023-12-03_18-31-29

  • I tested my model on MT-bench, it seems the model is getting worse, when compared to my base model. I wonder is there any idea, that can increase the mean rewards, instead of decreasing the mean rewards? I have increased the batch size to 128

  • In addition, I think the training process is too slow, especially for the generation process. I want to generate 512 tokens or 1024 tokens, is there any ideas that I can make this process more quickly?

ps:

  • Interestingly, although the mean reward decreased, the 200 step model on MT-bench increased from 6.68 to 6.89 (average score). I think it is interesting and I will continue the training and observe its performance.

@mgerstgrasser
Copy link
Contributor Author

mgerstgrasser commented Dec 3, 2023

@kangqiyue

* I run the script with accelerate, and I tried Deepspeed stage 2 and stage 3. For Deepspeed stage 3, I met the same question: for the zero iteration took too a long time and the error is shown as below:

RuntimeError: NCCL communicator was aborted on rank 0. Original reason for failure was: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1092, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805633 milliseconds before timing out. [E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.

Hm, that could be that it's the same issue I encountered, just for you it takes so long it triggers a timeout. Maybe there's a way to increase timeouts? Or try stage 3 with a smaller batch size, e.g. 4?

* Therefore I tried Deepspeed stage 2, and I met OOM. So, I have to change from full finetue to lora finetune. For this, setting, I can run the script on 1_A100. But when I tried to use 8_A100, then it meet device error:

return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select) Traceback (most recent call last):

Interesting, I haven't run into that, but it's curious that the error happens at the exact same line as the FSDP error! Do you maybe need a .to(ppo_trainer.accelerator.device) somewhere, e.g. on the inputs?

* Finally, I changed to a much light reward model, and I run the script with Deepspeed stage 2. The iteration is fine, except the generation process is much slow. If I generate 256 tokens, the iteration, the time is 280.36s/it. However, I run this script with Deepspeed stage 2. It will not stuck at the first iteration. Try Deepspeed stage 2 for finetuning! (note: need to freeze share layers)

Shared layers between what? Actor and critic? Or are you somehow sharing layers between the actor-critic model and the reward model?

* `Invalidate trace cache @ step 0: expected module 0, but got module 1. ` I also got this message error, and I don't know how to do with it. I think we should dig more on this warning.

* For FSDP, I did not use it, therefore I have no idea how to run FDSP.

For FSDP, I actually have it sort-of working now. Indeed calling model.forward() before trainer.generate() for some reason can fix that memory allocation error. (At least on my modified llama-2 version of the example script, for some reason I couldn't get it to work in the original example script.) There are other remaining issues though, e.g. the standard trl AutoModelForCausalLMWithValueHead forces the value head layer to be fp32 even if the rest of the model is bf16, which doesn't work with FSDP out of the box. If I make that bf16 instead, I can run the script with FSDP. It's still slow (albeit not as slow as with Deepspeed), and memory is also problematic.

@mgerstgrasser @lvwerra

* In addition, when I start training, the rewards decrease, as follows:

Snipaste_2023-12-03_18-31-29

* I tested my model on MT-bench, it seems the model is getting worse, when compared to my base model. I wonder is there any idea, that can increase the mean rewards, instead of decreasing the mean rewards? I have increased the batch size to 128

I have noticed bad training metrics as well, both with the original example script as well as with my own llama-2 version, e.g. KL divergence goes negative in both very quickly. I haven't looked into it further yet, but I would guess that the PPO hyperparameters in the training script might be completely off. Again, haven't looked into it though.

* In addition, I think the training process is too slow, especially for the generation process. I want to generate 512 tokens or 1024 tokens, is there any ideas that I can make this process more quickly?

Yes, agree. In my own testing, it seems 99% of the time each iteration is spent generating, that's clearly not good. I imagine there must be some bug with sharded generation, but I don't know much more than that right now.

@kangqiyue
Copy link

kangqiyue commented Dec 4, 2023

@mgerstgrasser

  • I run the script with accelerate, and I tried Deepspeed stage 2 and stage 3. For Deepspeed stage 3, I met the same question: for the zero iteration took too a long time and the error is shown as below:
    RuntimeError: NCCL communicator was aborted on rank 0. Original reason for failure was: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1092, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805633 milliseconds before timing out. [E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.

Hm, that could be that it's the same issue I encountered, just for you it takes so long it triggers a timeout. Maybe there's a way to increase timeouts? Or try stage 3 with a smaller batch size, e.g. 4?

I think it is the same question that we have met. It very wired for the first iteration which take a very long time. I think what you observed could be the problem that I met. As you have pointed:

DeepSpeed: Example does work, just very slow.
It turns out, the example does work with Deepspeed - it's just that it's so slow, I thought it was stuck. But if I turn down the batch and minibatch size to 4 or 16, it does run okay - still pretty slow though.

Interesting, I haven't run into that, but it's curious that the error happens at the exact same line as the FSDP error! Do you maybe need a somewhere, e.g. on the inputs?.to(ppo_trainer.accelerator.device)

For lora tuning, I just used the default code in ppo.py, as follows:
` query_tensors = batch["input_ids"]

# Get response from gpt2
response_tensors, ref_response_tensors = ppo_trainer.generate(
    query_tensors, return_prompt=False, generate_ref_response=True, **generation_kwargs
)`

I think the ppo_trainer.generate may automatically process the input tensors, so I did not handle them manually. As you pointed, maybe I could try: .to(ppo_trainer.accelerator.device)

Shared layers between what? Actor and critic? Or are you somehow sharing layers between the actor-critic model and the reward model?

  • My origin reward model is a 7B model, therefore I change it to a small model (very small) that I won't meet OOM errors. For the training, the training model (Actor) and the critic model (ref model, if I understand right, I usually use model and ref_model) used share layers, using the create_reference_model function. Try it at Deepspeed stage 2, if you met OOM, then freeze all layers except the last layer. Then I think you can run your script.

Yes, agree. In my own testing, it seems 99% of the time each iteration is spent generating, that's clearly not good. I imagine there must be some bug with sharded generation, but I don't know much more than that right now.

In addition, I have a question about the training model and the ref_model. I have not figure out if the parameters of the ref_model is updated or not in the training process. As far as I'm concerned, the training model's parameters are updated in the training. But for the ref_model, I do not know if the ref_model's parameters are updated or not in the training. Can you tell me your opinion on this question? I think if the ref_model is frozen, we can speed up the ref_model's generation.

@vwxyzjn
Copy link
Contributor

vwxyzjn commented Dec 7, 2023

We have some preliminary tests working with 7B models using PPO.

Can you try running accelerate launch --config_file examples/accelerate_configs/deepspeed_zero2.yaml examples/scripts/ppo.py --ppo_config.exp_name ppo_Cerebras-GPT-6.7B_grad_accu_deepspeed_stage2 --ppo_config.batch_size 32 --ppo_config.mini_batch_size 32 --ppo_config.log_with wandb --ppo_config.model_name cerebras/Cerebras-GPT-6.7B --ppo_config.reward_model sentiment-analysis:cerebras/Cerebras-GPT-6.7B see at least if your can run this script with 8 GPUs?

image

@viethoangtranduong
Copy link
Contributor

@vwxyzjn I've been using your script for a specific use case, and it proceeded to the training phase —thank you for providing it.

However, I cannot find ways to safely save the model at certain checkpoints/ after some epochs/ at the end using ppo.py.

The file does not provide a direct way to save models after certain epochs, so I manually try to save the model.

When I use either ppo_trainer._save_pretrained(path_name) or ppo_trainer.save_pretrained(path_name) to save the model, it seems to save only a subset of layers. For instance, the output indicates that shared tensors like 'model.layers.9.mlp.up_proj.weight', 'model.layers.10.post_attention_layernorm.weight', and several others are removed during the save process. The message displayed is:

Removed shared tensor {'model.layers.9.mlp.up_proj.weight' .... q_proj.weight', 'model.layers.10.post_attention_layernorm.weight', 'model.layers.24.mlp.down_proj.weight', 'model.layers.1.self_attn.k_proj.weight', 'model.layers.4.self_attn.q_proj.weight', 'model.layers.13.mlp.up_proj.weight'} while saving. This should be OK, but check by verifying that you don't receive any warning while reloading

Subsequently, when I attempt to reload that saved model using AutoModelForCausalLM.from_pretrained, it results in a warning message:

Some weights of MistralForCausalLM were not initialized from the model checkpoint at my_save_path and are newly initialized [those layers above]

This suggests that the model is not being saved correctly. Could you provide any guidance on how to ensure the entire model is saved and can be properly reloaded? My code for saving the model to a local directory is as straightforward as the commands mentioned above.

The accelerate script I ran is exactly like the above, and I modified examples/accelerate_configs/deepspeed_zero2.yaml to be compatible with 4 GPUs instead of the default 8. I'm running on 4xA100 80GB. I modified ppo.py to save ppo_traianer after X epochs. I tried both saving at every GPU or conditioning to only save ppo_trainer if device==0, and neither of them worked.

Thank you for your guidance.

transformers.__version__, accelerate.__version__, deepspeed.__version__
>> ('4.35.1', '0.24.1', '0.12.5')

@mgerstgrasser
Copy link
Contributor Author

mgerstgrasser commented Jan 4, 2024

We have some preliminary tests working with 7B models using PPO.

Can you try running accelerate launch --config_file examples/accelerate_configs/deepspeed_zero2.yaml examples/scripts/ppo.py --ppo_config.exp_name ppo_Cerebras-GPT-6.7B_grad_accu_deepspeed_stage2 --ppo_config.batch_size 32 --ppo_config.mini_batch_size 32 --ppo_config.log_with wandb --ppo_config.model_name cerebras/Cerebras-GPT-6.7B --ppo_config.reward_model sentiment-analysis:cerebras/Cerebras-GPT-6.7B see at least if your can run this script with 8 GPUs?

Hi @vwxyzjn - thanks so much for looking into this, and sorry it's taken me a while to reply. I've finally had time to try this out, and unfortunately, this still fails for me! In fact, the script fails even with the default model when using deepspeed for me (i.e. running the command you suggested, but without setting either ppo_config.model_name or ppo_config.reward_model).
The error seems to be a timeout somewhere during the generate() call; the exact error message varies between runs, but it seems to always be in that method. It also seems that iteration 0 runs fine, but then iteration 1 throws the error. The seems to match with some of my earlier experiments where I could get it to run with much smaller batch sizes, but generate() in iteration 1 took a very, very long time - possibly that's still the issue here, but with a bigger batch size it leads to a timeout?

edit: removed logs as they are not relevant, see next comment.

@mgerstgrasser
Copy link
Contributor Author

Further update: After a lot more digging, it turns out the NCCL timeout in Deepspeed Stage 2 was the same issue as #1103, and is fixed by #1177.

I'm still seeing issues with Deepspeed Stage 3 though, e.g. the Invalidate trace cache @ step 0: expected module 0, but got module 1 warning above, performance seems much lower than I would expect, and sometimes it crashes after a few iterations with the params INFLIGHT error I mentioned earlier.
Any chance you have any advice on that? @vwxyzjn

@vwxyzjn
Copy link
Contributor

vwxyzjn commented Jan 5, 2024

Just to confirm, does deepspeed stage 3 work for you at the moment? I had weird experiences with stage 3 as well, saw Invalidate trace cache @ step 0: expected module 0, but got module 1.

stage 2 should be sufficient, as https://github.com/OpenLMLab/MOSS-RLHF can fit in a 6.9B critic, reward, policy, and ref policy model.

Also, what's your hardware settings?

@mgerstgrasser
Copy link
Contributor Author

mgerstgrasser commented Jan 5, 2024

Just to confirm, does deepspeed stage 3 work for you at the moment? I had weird experiences with stage 3 as well, saw Invalidate trace cache @ step 0: expected module 0, but got module 1.

stage 2 should be sufficient, as https://github.com/OpenLMLab/MOSS-RLHF can fit in a 6.9B critic, reward, policy, and ref policy model.

Also, what's your hardware settings?

Ah, ideally I'd like to scale up to 13B and maybe even 70B models though.

Re "is stage 3 working" - yes! Up until last night it always crashed after a few steps with that params INFLIGHT error, but it seems this was also fixed by #1177, at least as far as I can tell in a few preliminary experiments.

The remaining issue is that it's slow - around 6-8 times as long per iteration compared to stage 2. That's just with the default model in the ppo example script, and 2 or 4 NVLinked GPUs. I realise of course that it will be somewhat slower than stage 2, but is that big of a performance drop expected?

Re hardware: Varies, but the main configuration I've been using has 4x A100 80GB SXM4 per node, and 200Gbps Infiniband between nodes. For the 7B model I've tested on a node with 8x A100 80GB SXM4 GPUs.

@muupan
Copy link
Contributor

muupan commented Jan 21, 2024

Hello, I am facing the same error in deepspeed zero3, AssertionError assert param.ds_status == ZeroParamStatus.AVAILABLE, in self.accelerator.backward(loss) of PPOTrainer.train_minibatch. I suspect the error is related to the issues reported in microsoft/DeepSpeed#4194 and microsoft/DeepSpeed#4194, where using transformers==4.31.0 is reported to work. But for me downgrading transformers caused other issues, so I really hope to know other solutions if any.

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

@chongxiaoc
Copy link

I'm seeing the same issue with Deepspeed ZeRO 3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants