Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: RuntimeError for UCP large DP #6918

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

fix: RuntimeError for UCP large DP #6918

wants to merge 4 commits into from

Conversation

saforem2
Copy link
Collaborator

We encountered a strange bug when attempting to convert checkpoints (created with DP=768) to universal format.

An overview of the bug as well as a detailed description of the proposed fix is written up in:

argonne-lcf / Megatron-DeepSpeed / ALCF / notes / universal_checkpoint_bug.md

@loadams loadams requested a review from lekurile December 30, 2024 17:54
@saforem2
Copy link
Collaborator Author

@loadams thanks for the formatting fix!

Also just wanted to say no rush on this, I spoke briefly with @minjiazhang before the holidays about this issue and mentioned I would write up a more complete description of what I was seeing.

Maybe some minor thoughts:

I think the change in deepspeed / checkpoint / deepspeed_checkpoint.py,
e.g. passing the strip_tensor_paddings argument through to the self.zero_checkpoint.get_state_for_rank call (shown below):

-    def get_zero_checkpoint_state(self, pp_index, tp_index, dp_index) -> dict:
+    def get_zero_checkpoint_state(self, pp_index, tp_index, dp_index, strip_tensor_paddings: bool = True):
         return self.zero_checkpoint.get_state_for_rank(pp_index=pp_index,
                                                        tp_index=tp_index,
                                                        dp_index=dp_index,
                                                        keys_to_ignore=[PARAM_SHAPES])
                                                        keys_to_ignore=[PARAM_SHAPES],
+                                                       strip_tensor_paddings=strip_tensor_paddings)

✅ is OK since this just passes the argument through.

However, I'm a bit less sure about this change:

 sd = ds_checkpoint.get_zero_checkpoint_state(pp_index=pp_index,
                                                  tp_index=tp_index,
                                                  dp_index=dp_index,
+                                                 strip_tensor_paddings=False)

since I'm not completely clear on how the internals of this
_strip_tensor_paddings() function work.

For our purposes, setting this to False, and thereby skipping the:

if strip_tensor_paddings:
    self._strip_tensor_paddings(sd)

block in the get_state_for_rank call seems to work, though I'm not really sure why.

@xylian86
Copy link
Contributor

xylian86 commented Jan 20, 2025

@saforem2 Thank you for reporting the issues!

Let me first explain why we need _strip_tensor_paddings() function:

In ZeRO, when optimizer states are partitioned across different ranks, padding is added to align NCCL all-gather send buffers to 4-byte boundaries, so when DP=768, the alignment becomes 4 / 2 * 768 = 1536 (Code Reference).

Universal Checkpointing follows a padding-free design principle. When converting to Universal Checkpoints, we must remove these paddings because:

  • Different GPU cluster configurations will require different padding sizes
  • New padding values should be calculated dynamically during loading
  • Retaining old padding could lead to functional or correctness issues when loading checkpoints

Therefore, in this PR, skipping _strip_tensor_paddings() might avoid immediate conversion issues, but I suspect it might be issues when loading universal checkpoints later if we keep the padding, either functional issue or correctness issue.


Why the conversion works for smaller DP size but failed on DP size equals to 768.

Let me explain the cause of the issue, from the log you provided (thank you for providing such detailed logs!!!), I can see the model config (number of layers = 32, hiddle size = 4096), which is a ~7B model, and you trained with ZeRO1 with DP size 768.

Param groups will be created based on weight decay condition (regularized vs non regularized) and learning rate scale condition (args.lr vs lr_mult * args.lr) (Code Reference).

For one group, the flatten tensor numel is 266240, the alignment is 1536, so the required padding = 1536 * 174 - 266240 = 1024, per-rank partition size is 348 (1536 * 174 / 768). The current implementation assumes that the number of elements assigned to each rank will exceed the padding size. However, in this case, we have 348 < 1024, violating this assumption.


The correct way is to fix this issue in _strip_tensor_paddings() function to handle such corner cases.

@tjruwase @tohtana Please let me know if you have any suggestons or feedbacks on it.

@xylian86
Copy link
Contributor

More specifically, the issue arises from the following lines in ZeRO 1 Engine

            if partition_id == dist.get_world_size(group=self.real_dp_process_group[i]) - 1:
                padding = self.bit16_groups_flat[i].numel() - orig_group_numel
            else:
                padding = 0

that has the assumption that only the last rank will have the padding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants