Model Inference issues with GPT-MoE models #458

1155157110 · 2025-01-25T06:25:09Z

As mentioned by other issues:

there seems to be some compatibility issues using the generate_text.sh to run the pretrained model checkpoint generated by the examples. I have trained GPT2-125M-MoE64 with the ds_pretrain_gpt_125M_MoE64.sh, and get the checkpoint files as below:

checkpoint
└── gpt-0.125B-lr-4.5e-4-minlr-4.5e-06-bs-256-gpus-4-mp-1-pp-1-ep-64-mlc-0.01-cap-1.0-drop-true
    ├── global_step10000
    │   ├── expp_rank_0_mp_rank_00_optim_states.pt
    │   ├── expp_rank_1_mp_rank_00_optim_states.pt
    │   ├── expp_rank_2_mp_rank_00_optim_states.pt
    │   ├── expp_rank_3_mp_rank_00_optim_states.pt
    │   ├── layer_0_expert_0_mp_rank_00_model_states.pt
    │   ├── layer_0_expert_1_mp_rank_00_model_states.pt
    │   ├── ...
    │   ├── layer_5_expert_63_mp_rank_00_model_states.pt
    │   └── mp_rank_00_model_states.pt
    ├── latest
    └── latest_checkpointed_iteration.txt

However, when loading checkpoint, the predefined checkpoint path does not match the checkpoint folder layout. In the get_checkpoint_name function in megatron/checkpointing.py line 98, the defined checkpoint path is:

def get_checkpoint_name(checkpoints_path, iteration, release=False,
                        pipeline_parallel=None,
                        tensor_rank=None, pipeline_rank=None):
    """Determine the directory name for this rank's checkpoint."""
    if release:
        directory = 'release'
    else:
        directory = 'iter_{:07d}'.format(iteration)
    ...
    return os.path.join(common_path, "model_optim_rng.pt")

Bypassing the naming issues by changing the refered checkpoint names (i.e., set directory = 'global_step{:05d}'.format(iteration) and return f"{common_path}_model_states.pt", generate_text.sh will take me to the above mentioned issues.

Is further model conversion needed to run the GPT-MoE models? As the tutorials are not quite clear, may I know how to run the GPT-MoE models with deepspeed expert parallelism?

The text was updated successfully, but these errors were encountered:

1155157110 mentioned this issue Jan 25, 2025

[BUG] Errors in GPT-MoE models Inferences deepspeedai/DeepSpeed#6973

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model Inference issues with GPT-MoE models #458

Model Inference issues with GPT-MoE models #458

1155157110 commented Jan 25, 2025 •

edited

Loading

Model Inference issues with GPT-MoE models #458

Model Inference issues with GPT-MoE models #458

Comments

1155157110 commented Jan 25, 2025 • edited Loading

1155157110 commented Jan 25, 2025 •

edited

Loading