fix: `RuntimeError` for UCP large DP #6918

saforem2 · 2024-12-29T18:23:51Z

We encountered a strange bug when attempting to convert checkpoints (created with DP=768) to universal format.

An overview of the bug as well as a detailed description of the proposed fix is written up in:

argonne-lcf / Megatron-DeepSpeed / ALCF / notes / universal_checkpoint_bug.md

saforem2 · 2024-12-30T19:29:00Z

@loadams thanks for the formatting fix!

Also just wanted to say no rush on this, I spoke briefly with @minjiazhang before the holidays about this issue and mentioned I would write up a more complete description of what I was seeing.

Maybe some minor thoughts:

I think the change in deepspeed / checkpoint / deepspeed_checkpoint.py,
e.g. passing the strip_tensor_paddings argument through to the self.zero_checkpoint.get_state_for_rank call (shown below):

-    def get_zero_checkpoint_state(self, pp_index, tp_index, dp_index) -> dict:
+    def get_zero_checkpoint_state(self, pp_index, tp_index, dp_index, strip_tensor_paddings: bool = True):
         return self.zero_checkpoint.get_state_for_rank(pp_index=pp_index,
                                                        tp_index=tp_index,
                                                        dp_index=dp_index,
                                                        keys_to_ignore=[PARAM_SHAPES])
                                                        keys_to_ignore=[PARAM_SHAPES],
+                                                       strip_tensor_paddings=strip_tensor_paddings)

✅ is OK since this just passes the argument through.

However, I'm a bit less sure about this change:

 sd = ds_checkpoint.get_zero_checkpoint_state(pp_index=pp_index,
                                                  tp_index=tp_index,
                                                  dp_index=dp_index,
+                                                 strip_tensor_paddings=False)

since I'm not completely clear on how the internals of this
_strip_tensor_paddings() function work.

For our purposes, setting this to False, and thereby skipping the:

if strip_tensor_paddings:
    self._strip_tensor_paddings(sd)

block in the get_state_for_rank call seems to work, though I'm not really sure why.

xylian86 · 2025-01-20T17:06:48Z

@saforem2 Thank you for reporting the issues!

Let me first explain why we need _strip_tensor_paddings() function:

In ZeRO, when optimizer states are partitioned across different ranks, padding is added to align NCCL all-gather send buffers to 4-byte boundaries, so when DP=768, the alignment becomes 4 / 2 * 768 = 1536 (Code Reference).

Universal Checkpointing follows a padding-free design principle. When converting to Universal Checkpoints, we must remove these paddings because:

Different GPU cluster configurations will require different padding sizes
New padding values should be calculated dynamically during loading
Retaining old padding could lead to functional or correctness issues when loading checkpoints

Therefore, in this PR, skipping _strip_tensor_paddings() might avoid immediate conversion issues, but I suspect it might be issues when loading universal checkpoints later if we keep the padding, either functional issue or correctness issue.

Why the conversion works for smaller DP size but failed on DP size equals to 768.

Let me explain the cause of the issue, from the log you provided (thank you for providing such detailed logs!!!), I can see the model config (number of layers = 32, hiddle size = 4096), which is a ~7B model, and you trained with ZeRO1 with DP size 768.

Param groups will be created based on weight decay condition (regularized vs non regularized) and learning rate scale condition (args.lr vs lr_mult * args.lr) (Code Reference).

For one group, the flatten tensor numel is 266240, the alignment is 1536, so the required padding = 1536 * 174 - 266240 = 1024, per-rank partition size is 348 (1536 * 174 / 768). The current implementation assumes that the number of elements assigned to each rank will exceed the padding size. However, in this case, we have 348 < 1024, violating this assumption.

The correct way is to fix this issue in _strip_tensor_paddings() function to handle such corner cases.

@tjruwase @tohtana Please let me know if you have any suggestons or feedbacks on it.

xylian86 · 2025-01-20T18:54:56Z

More specifically, the issue arises from the following lines in ZeRO 1 Engine

            if partition_id == dist.get_world_size(group=self.real_dp_process_group[i]) - 1:
                padding = self.bit16_groups_flat[i].numel() - orig_group_numel
            else:
                padding = 0

that has the assumption that only the last rank will have the padding.

fix: ds_to_universal.py when for large DP

cc1478e

saforem2 requested a review from tjruwase as a code owner December 29, 2024 18:23

saforem2 assigned saforem2, minjiazhang and samadejacobs and unassigned saforem2, minjiazhang and samadejacobs Dec 29, 2024

saforem2 requested review from minjiazhang and samadejacobs December 29, 2024 18:25

Formatting fix

d143141

loadams requested a review from lekurile December 30, 2024 17:54

Merge branch 'master' into saforem2/ucp-bug

ec9556f

Merge branch 'master' into saforem2/ucp-bug

ad99b01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: `RuntimeError` for UCP large DP #6918

fix: `RuntimeError` for UCP large DP #6918

saforem2 commented Dec 29, 2024

saforem2 commented Dec 30, 2024

xylian86 commented Jan 20, 2025 •

edited

Loading

xylian86 commented Jan 20, 2025

fix: RuntimeError for UCP large DP #6918

Are you sure you want to change the base?

fix: RuntimeError for UCP large DP #6918

Conversation

saforem2 commented Dec 29, 2024

saforem2 commented Dec 30, 2024

xylian86 commented Jan 20, 2025 • edited Loading

Let me first explain why we need _strip_tensor_paddings() function:

Why the conversion works for smaller DP size but failed on DP size equals to 768.

xylian86 commented Jan 20, 2025

fix: `RuntimeError` for UCP large DP #6918

fix: `RuntimeError` for UCP large DP #6918

xylian86 commented Jan 20, 2025 •

edited

Loading