[MoE][PyTorch] Add mask-based MoE permutation #1373

hxbai · 2024-12-13T04:49:02Z

Description

Add mask-based token permutation and local chunk permutation fused kernels. These kernels are implemented with OpenAI Triton.

Related commit in Megatron-LM NVIDIA/Megatron-LM@ac0474d

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refractor

Changes

Please list the changes introduced in this PR:

Non-breaking API changes in te.pytorch.permutation.moe_permute and te.pytorch.permutation.moe_unpermute
Add new APIs of te.pytorch.permutation.moe_sort_chunks_by_indices

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

transformer_engine/pytorch/permutation.py

Signed-off-by: Hongxiao Bai <[email protected]>

for more information, see https://pre-commit.ci Signed-off-by: Hongxiao Bai <[email protected]>

Signed-off-by: Hongxiao Bai <[email protected]>

timmoon10 · 2025-01-08T19:03:56Z

transformer_engine/pytorch/permutation.py

 ]


-class _moe_permute(torch.autograd.Function):
-    """functional Permute"""
+class _moe_permute_indice_map(torch.autograd.Function):


Suggested change

class _moe_permute_indice_map(torch.autograd.Function):

class _moe_permute_index_map(torch.autograd.Function):

We should make sure to use "index" in user-facing APIs like moe_permute/moe_unpermute.

timmoon10 · 2025-01-08T21:16:00Z

transformer_engine/pytorch/permutation.py

 import warnings
 from typing import Tuple
 import torch

 import transformer_engine_torch as tex
-from .constants import TE_DType
-from .float8_tensor import Float8Tensor
+import transformer_engine.pytorch.triton.permutation as triton_permuataion


Nit:

Suggested change

import transformer_engine.pytorch.triton.permutation as triton_permuataion

import transformer_engine.pytorch.triton.permutation as triton_permutation

timmoon10 · 2025-01-08T21:23:41Z

transformer_engine/pytorch/permutation.py

+            if ctx.fp8:
+                assert isinstance(
+                    permuted_act_grad, Float8Tensor
+                ), "Grad of the output must be in Float8Tensor type for FP8 moe_permute."


Couldn't we decouple FP8 in the forward and backward?

Suggested change

if ctx.fp8:

assert isinstance(

permuted_act_grad, Float8Tensor

), "Grad of the output must be in Float8Tensor type for FP8 moe_permute."

fp8 = isinstance(permuted_act_grad, Float8Tensor)

if fp8:

If there are no obstacles, we could also do the same thing for _moe_unpermute_mask_map and _moe_chunk_sort.

timmoon10 · 2025-01-08T21:50:21Z

tests/pytorch/test_permutation.py

+    # Results Check
+    #
+    ###################################################################################################################################
+    tols = dtype_tols(te_dtype)


Shouldn't we expect bit-wise exact results?

Suggested change

tols = dtype_tols(te_dtype)

tols = { "atol": 0, "rtol": 0 }

timmoon10 · 2025-01-08T21:54:30Z

tests/pytorch/test_permutation.py

+    # Results Check
+    #
+    ###################################################################################################################################
+    tols = dtype_tols(te_dtype)


We should expect bit-wise exact results.

Suggested change

tols = dtype_tols(te_dtype)

tols = { "atol": 0, "rtol": 0 }

phu0ngng · 2025-01-10T17:21:50Z

transformer_engine/pytorch/triton/permutation.py

+        mask=(offset < num_tokens),
+        other=0,
+    ).to(tl.int64)
+    expert_token_cumsum = tl.cumsum(expert_token_mask) * expert_token_mask


An interesting way to exclude the zero token_mask. Happy to learn!

phu0ngng · 2025-01-10T17:45:28Z

transformer_engine/pytorch/triton/permutation.py

+    chunk_cumsum = tl.load(
+        row_id_map_ptr + pid_m * num_tokens + offset, mask=(offset < num_tokens), other=0
+    )
+
+    workspace_off = tl.arange(0, WORKSPACE_LOAD_WIDTH)
+    chunk_sums = tl.load(workspace_ptr + workspace_off, mask=workspace_off < chunk_idx)
+    chunk_cumsum = tl.where(chunk_cumsum == 0, -1, chunk_cumsum + tl.sum(chunk_sums) - 1)


These three names chuck_cumsum, chuck_sums, and chunk_cumsum are quite confusing.
If I understand it correctly, I suggest to rename them to:

chuck_cumsum -> row_id_within_token_block

chuck_sums -> n_tokens_per_expert

chuck_cumsum -> row_id

In addition, I think we should move the -1 to the pass1 as it is the correction for the calculation of expert_token_cumsum, as:

expert_token_cumsum = (tl.cumsum(expert_token_mask) - 1) * expert_token_mask

hxbai changed the title ~~[MoE][Common/PyTorch] Add mask-based MoE permutation~~ [MoE][PyTorch] Add mask-based MoE permutation Dec 13, 2024

yaox12 reviewed Dec 13, 2024

View reviewed changes

transformer_engine/pytorch/permutation.py Show resolved Hide resolved

hxbai and others added 4 commits December 13, 2024 06:05

add mask based moe permutation

7e04f9a

Signed-off-by: Hongxiao Bai <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

664af70

for more information, see https://pre-commit.ci Signed-off-by: Hongxiao Bai <[email protected]>

change moe_chunk_permute to moe_sort_chunks_by_indices

a8f1daa

Signed-off-by: Hongxiao Bai <[email protected]>

fix __all__ in pytorch/permutation.py

ca94d72

Signed-off-by: Hongxiao Bai <[email protected]>

hxbai force-pushed the permute_fusion branch from 6160104 to ca94d72 Compare December 13, 2024 06:05

phu0ngng self-requested a review January 8, 2025 15:20

timmoon10 reviewed Jan 8, 2025

View reviewed changes

timmoon10 self-requested a review January 8, 2025 21:57

phu0ngng reviewed Jan 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MoE][PyTorch] Add mask-based MoE permutation #1373

[MoE][PyTorch] Add mask-based MoE permutation #1373

hxbai commented Dec 13, 2024 •

edited

Loading

timmoon10 Jan 8, 2025

timmoon10 Jan 8, 2025

timmoon10 Jan 8, 2025

timmoon10 Jan 8, 2025

timmoon10 Jan 8, 2025

phu0ngng Jan 10, 2025

phu0ngng Jan 10, 2025 •

edited

Loading

	class _moe_permute_indice_map(torch.autograd.Function):
	class _moe_permute_index_map(torch.autograd.Function):

	import transformer_engine.pytorch.triton.permutation as triton_permuataion
	import transformer_engine.pytorch.triton.permutation as triton_permutation

[MoE][PyTorch] Add mask-based MoE permutation #1373

Are you sure you want to change the base?

[MoE][PyTorch] Add mask-based MoE permutation #1373

Conversation

hxbai commented Dec 13, 2024 • edited Loading

Description

Type of change

Changes

Checklist:

timmoon10 Jan 8, 2025

Choose a reason for hiding this comment

timmoon10 Jan 8, 2025

Choose a reason for hiding this comment

timmoon10 Jan 8, 2025

Choose a reason for hiding this comment

timmoon10 Jan 8, 2025

Choose a reason for hiding this comment

timmoon10 Jan 8, 2025

Choose a reason for hiding this comment

phu0ngng Jan 10, 2025

Choose a reason for hiding this comment

phu0ngng Jan 10, 2025 • edited Loading

Choose a reason for hiding this comment

hxbai commented Dec 13, 2024 •

edited

Loading

phu0ngng Jan 10, 2025 •

edited

Loading