Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model Inference issues with GPT-MoE models #458

Open
1155157110 opened this issue Jan 25, 2025 · 0 comments
Open

Model Inference issues with GPT-MoE models #458

1155157110 opened this issue Jan 25, 2025 · 0 comments

Comments

@1155157110
Copy link

1155157110 commented Jan 25, 2025

As mentioned by other issues:

there seems to be some compatibility issues using the generate_text.sh to run the pretrained model checkpoint generated by the examples. I have trained GPT2-125M-MoE64 with the ds_pretrain_gpt_125M_MoE64.sh, and get the checkpoint files as below:

checkpoint
└── gpt-0.125B-lr-4.5e-4-minlr-4.5e-06-bs-256-gpus-4-mp-1-pp-1-ep-64-mlc-0.01-cap-1.0-drop-true
    ├── global_step10000
    │   ├── expp_rank_0_mp_rank_00_optim_states.pt
    │   ├── expp_rank_1_mp_rank_00_optim_states.pt
    │   ├── expp_rank_2_mp_rank_00_optim_states.pt
    │   ├── expp_rank_3_mp_rank_00_optim_states.pt
    │   ├── layer_0_expert_0_mp_rank_00_model_states.pt
    │   ├── layer_0_expert_1_mp_rank_00_model_states.pt
    │   ├── ...
    │   ├── layer_5_expert_63_mp_rank_00_model_states.pt
    │   └── mp_rank_00_model_states.pt
    ├── latest
    └── latest_checkpointed_iteration.txt

However, when loading checkpoint, the predefined checkpoint path does not match the checkpoint folder layout. In the get_checkpoint_name function in megatron/checkpointing.py line 98, the defined checkpoint path is:

def get_checkpoint_name(checkpoints_path, iteration, release=False,
                        pipeline_parallel=None,
                        tensor_rank=None, pipeline_rank=None):
    """Determine the directory name for this rank's checkpoint."""
    if release:
        directory = 'release'
    else:
        directory = 'iter_{:07d}'.format(iteration)
    ...
    return os.path.join(common_path, "model_optim_rng.pt")

Bypassing the naming issues by changing the refered checkpoint names (i.e., set directory = 'global_step{:05d}'.format(iteration) and return f"{common_path}_model_states.pt", generate_text.sh will take me to the above mentioned issues.

Is further model conversion needed to run the GPT-MoE models? As the tutorials are not quite clear, may I know how to run the GPT-MoE models with deepspeed expert parallelism?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant