-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: RuntimeError
for UCP large DP
#6918
base: master
Are you sure you want to change the base?
Conversation
@loadams thanks for the formatting fix! Also just wanted to say no rush on this, I spoke briefly with @minjiazhang before the holidays about this issue and mentioned I would write up a more complete description of what I was seeing. Maybe some minor thoughts: I think the change in deepspeed / checkpoint / - def get_zero_checkpoint_state(self, pp_index, tp_index, dp_index) -> dict:
+ def get_zero_checkpoint_state(self, pp_index, tp_index, dp_index, strip_tensor_paddings: bool = True):
return self.zero_checkpoint.get_state_for_rank(pp_index=pp_index,
tp_index=tp_index,
dp_index=dp_index,
keys_to_ignore=[PARAM_SHAPES])
keys_to_ignore=[PARAM_SHAPES],
+ strip_tensor_paddings=strip_tensor_paddings) ✅ is OK since this just passes the argument through. However, I'm a bit less sure about this change: sd = ds_checkpoint.get_zero_checkpoint_state(pp_index=pp_index,
tp_index=tp_index,
dp_index=dp_index,
+ strip_tensor_paddings=False) since I'm not completely clear on how the internals of this For our purposes, setting this to False, and thereby skipping the: if strip_tensor_paddings:
self._strip_tensor_paddings(sd) block in the |
@saforem2 Thank you for reporting the issues! Let me first explain why we need _strip_tensor_paddings() function:In ZeRO, when optimizer states are partitioned across different ranks, padding is added to align NCCL all-gather send buffers to 4-byte boundaries, so when DP=768, the alignment becomes 4 / 2 * 768 = 1536 (Code Reference). Universal Checkpointing follows a padding-free design principle. When converting to Universal Checkpoints, we must remove these paddings because:
Therefore, in this PR, skipping _strip_tensor_paddings() might avoid immediate conversion issues, but I suspect it might be issues when loading universal checkpoints later if we keep the padding, either functional issue or correctness issue. Why the conversion works for smaller DP size but failed on DP size equals to 768.Let me explain the cause of the issue, from the log you provided (thank you for providing such detailed logs!!!), I can see the model config (number of layers = 32, hiddle size = 4096), which is a ~7B model, and you trained with ZeRO1 with DP size 768. Param groups will be created based on weight decay condition (regularized vs non regularized) and learning rate scale condition (args.lr vs lr_mult * args.lr) (Code Reference). For one group, the flatten tensor numel is 266240, the alignment is 1536, so the required padding = 1536 * 174 - 266240 = 1024, per-rank partition size is 348 (1536 * 174 / 768). The current implementation assumes that the number of elements assigned to each rank will exceed the padding size. However, in this case, we have 348 < 1024, violating this assumption. The correct way is to fix this issue in _strip_tensor_paddings() function to handle such corner cases. @tjruwase @tohtana Please let me know if you have any suggestons or feedbacks on it. |
More specifically, the issue arises from the following lines in ZeRO 1 Engine
that has the assumption that only the last rank will have the padding. |
We encountered a strange bug when attempting to convert checkpoints (created with DP=768) to universal format.
An overview of the bug as well as a detailed description of the proposed fix is written up in:
argonne-lcf / Megatron-DeepSpeed / ALCF / notes /
universal_checkpoint_bug.md