-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use a single wgmma wait_group to flush async wgmma pipeline #3843
Conversation
…lar buffering for-loop * Add requires_commit arg to getSyncExprs * No wgmma commit required for ReadAfterWriteSyncs * Create flush_async_mma_pipeline
!test |
Review updated until commit fdd0f98 Description
Changes walkthrough 📝
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
@@ -759,6 +769,9 @@ class ReadAfterWriteSyncs : public kir::ExprMutator { | |||
} | |||
|
|||
private: | |||
//! Only a single wgmma wait_group to flush async mma pipeline. | |||
bool flush_async_mma_pipeline = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the reason for generating multiple commit and wait because there are multiple use of mma results, and one commit and wait is generated for each use? If this is the case, would it make more sense to promote the local variable input_async_ops
to a class member async_ops_to_sync_
where we add an async op and its type to it when seeing a new async op with that type, and insert a commit-wait and remove the async type (together with all ops with that type) if seeing an expr whose input's definition is that async type?
The current approach proposed here is basically saying: "If I have ever inserted a mma wait, then never insert one again in the fusion". I don't think this is safe, for example, if we have a kernel:
D1 = mma(A1, B1);
output1 = relu(D1);
D2 = mma(A2, B2);
output2 = relu(D2);
then we do need a wait before the output2 = relu(D2)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the reason for generating multiple commit and wait because there are multiple use of mma results, and one commit and wait is generated for each use?
Yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment, all RAW sync for wgmma occur outside K loop, so no new wgmma ops are issued. It is important not to issue wgmma.commit_group
. Essentially, RAW sync is flush async mma pipeline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TL;DR: I pushed a commit to set flush_async_mma_pipeline := false
when encountering a mma op, so we can issue more than one RAW sync.
Changes
fill_async_mma_pipeline
is true when any mma expression is issued. A RAW sync is required before any consumer operations use the results of mma expression.flush_async_mma_pipeline
is true when a RAW sync is issued for async mma pipeline. At the moment, the RAW sync for async wgmma iswgmma.wait_group(0)
. All prior mma operations are completed after this operation.fill_async_mma_pipeline
is always false at end ofReadAfterWriteSyncs
.
Two independent mma expressions; No shared circular buffered main loop
# <<<--- Step 0: fill_async_mma_pipeline == false && flush_async_mma_pipeline == false
D1 = mma(A1, B1)
# <<<--- Step 1: fill_async_mma_pipeline := true && flush_async_mma_pipeline == false
output1 = relu(D1)
# <<<--- Step 2: fill_async_mma_pipeline := false && flush_async_mma_pipeline == true
# (Add wgmma.wait_group(0) before relu)
D2 = mma(A2, B2)
# <<<--- Step 3: fill_async_mma_pipeline := true && flush_async_mma_pipeline == false
output2 = relu(D2)
# <<<--- Step 4: fill_async_mma_pipeline := false && flush_async_mma_pipeline == true
# (Add wgmma.wait_group(0) before relu)
Horizontal fused mma expressions; Shared circular buffered main loop
- Given that both mma operations share circular buffer main loop, they are grouped together with the same
wgmma.commit_group
.
# <<<--- Step 0: fill_async_mma_pipeline == false && flush_async_mma_pipeline == false
D1 = mma(A1, B1)
D2 = mma(A1, B2)
# <<<--- Step 1: fill_async_mma_pipeline := true && flush_async_mma_pipeline == false
output1 = relu(D1)
output2 = relu(D2)
# <<<--- Step 2: fill_async_mma_pipeline := false && flush_async_mma_pipeline == true
# (Add wgmma.wait_group(0) before relu)
Single mma with epilogue
- Given that both mma operations share circular buffer main loop, they are grouped together with the same
wgmma.commit_group
.
# <<<--- Step 0: fill_async_mma_pipeline == false && flush_async_mma_pipeline == false
D1 = mma(A1, B1)
# <<<--- Step 1: fill_async_mma_pipeline := true && flush_async_mma_pipeline == false
b = D1 + bias
# <<<--- Step 2: fill_async_mma_pipeline := false && flush_async_mma_pipeline == true
# (Add wgmma.wait_group(0) before add)
a = relu(b)
# <<<--- Step 3: fill_async_mma_pipeline := false && flush_async_mma_pipeline == true
# (Do nothing; RAW sync not required)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your proposal is more flexible than what is implemented now because it can be used for wgmma and tma store.
There are failure cases in the sync logic for both wgmma and tma store. I'd prefer to fix wgmma separately now. Then, try this proposal when fixing tma store failures.
Insert a commit-wait and remove the async type (together with all ops with that type) if seeing an expr whose input's definition is that async type.
This would flush an async type upon encountering an input definition of that async type.
Would this be sub-optimal for tma store async group?
If RAW sync for tma store async group is always outside circular buffer loop then maybe it is fine.
bd59867
to
e48b06b
Compare
!test |
This PR optimizes RAW sync insertion for wgmma operations.
Problem:
wgmma.commit_group
andwgmma.wait_group
together can cause the compiler to serialize wgmma.mma_async.wgmma.commit_group
is required when issuing wgmma operation group.wgmma.wait_group
is required when waiting for wgmma operation groups to finish.Proposed Solution:
wgmma.commit_group
after completing mma operations but before any consumer operations.bool requires_commit
tolower_utils::getSyncExprs
, so thecommit
phase is optional.Why?
wgmma.wait_group 0
flushes the entire pipeline, so all wgmma operations are complete. Additional RAW syncs are unnecessary.Cuda Examples
From MLPBenchmarkTest.FwdHorizontalFusion/data_parallel_warpspec
Without Fix: 4 RAW wgmma syncs
With Fix: 1 RAW wgmma syncs