You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
there seems to be some compatibility issues using the generate_text.sh to run the pretrained model checkpoint generated by the examples. I have trained GPT2-125M-MoE64 with the ds_pretrain_gpt_125M_MoE64.sh, and get the checkpoint files as below:
However, when loading checkpoint, the predefined checkpoint path does not match the checkpoint folder layout. In the get_checkpoint_name function in megatron/checkpointing.py line 98, the defined checkpoint path is:
def get_checkpoint_name(checkpoints_path, iteration, release=False,
pipeline_parallel=None,
tensor_rank=None, pipeline_rank=None):
"""Determine the directory name for this rank's checkpoint."""
if release:
directory = 'release'
else:
directory = 'iter_{:07d}'.format(iteration)
...
return os.path.join(common_path, "model_optim_rng.pt")
Bypassing the naming issues by changing the refered checkpoint names (i.e., set directory = 'global_step{:05d}'.format(iteration) and return f"{common_path}_model_states.pt", generate_text.sh will take me to the above mentioned issues.
Is further model conversion needed to run the GPT-MoE models? As the tutorials are not quite clear, may I know how to run the GPT-MoE models with deepspeed expert parallelism?
The text was updated successfully, but these errors were encountered:
As mentioned by other issues:
there seems to be some compatibility issues using the generate_text.sh to run the pretrained model checkpoint generated by the examples. I have trained GPT2-125M-MoE64 with the ds_pretrain_gpt_125M_MoE64.sh, and get the checkpoint files as below:
However, when loading checkpoint, the predefined checkpoint path does not match the checkpoint folder layout. In the
get_checkpoint_name
function in megatron/checkpointing.py line 98, the defined checkpoint path is:Bypassing the naming issues by changing the refered checkpoint names (i.e., set
directory = 'global_step{:05d}'.format(iteration)
andreturn f"{common_path}_model_states.pt"
, generate_text.sh will take me to the above mentioned issues.Is further model conversion needed to run the GPT-MoE models? As the tutorials are not quite clear, may I know how to run the GPT-MoE models with deepspeed expert parallelism?
The text was updated successfully, but these errors were encountered: