-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[LLPC] Scalarize non-uniform loads inside the waterfall loop #2759
Conversation
Can you add a description please? Is this PR identical to #2645 or have you fixed something? |
Yes, this patch is an update for #2645 |
Test summary for commit 7033cafCTS tests (Failed: 31/139021)
Ubuntu navi3x, SrdcvkUbuntu navi2x, Srdcvk |
test this please |
Technically this is a revert for #2705 as well? At a high level, GFX 10.3.2 codegen is just modifying the image descriptor after it is loaded as a workaround. |
As we discussed, this was my intention. Any suggestions are welcome :) |
Test summary for commit 7033cafCTS tests (Failed: 31/139021)
Ubuntu navi3x, SrdcvkUbuntu navi2x, Srdcvk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of matching feels sort of backwards. If we know the code sequences that should go into the waterfall loops, shouldn't we create them like that directly? I'll try to see if I can expand on this, but I think the gist of it is that we should put the waterfall loop creation upside-down: instead of creating the image instruction and then calling createWaterfallLoop, we should call createWaterfallLoop first, and then insert instructions into the loop.
lgc/builder/BuilderImpl.cpp
Outdated
Value *OrigLoad = nullptr; | ||
Value *OrigAnd = nullptr; | ||
Value *OrigGep = nullptr; | ||
Value *OrigExtract = nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Coding style in LLPC is to use lowerCamelCase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember now what the difficulty was with just creating waterfall loops directly. The descriptors load instructions are created separately from when the actual image access instructions are created. I wonder whether we shouldn't change this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose with some fixes this PR can be at least made correct.
llvm::Value *emitWaterfallBegin(llvm::Instruction *nonUniformInst, llvm::ArrayRef<unsigned> operandIdxs, | ||
llvm::ArrayRef<llvm::Value *> nonUniformIndices, bool useVgprForOperands = false, | ||
const llvm::Twine &instName = ""); | ||
|
||
llvm::Value *emitWaterfallBeginForScalarizedLoops(llvm::Instruction *nonUniformInst, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two methods should be private.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done :)
lgc/builder/BuilderImpl.cpp
Outdated
if (nonUniformIndices.empty()) | ||
return nonUniformInst; | ||
|
||
if (nonUniformInst->getType()->isVoidTy()) | ||
scalarizeDescriptorLoads = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am sorry I missed this. This is for the case that we have the following intrinsic: @llvm.amdgcn.struct.buffer.store.format.v4f32()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please comment on that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still doesn't make sense though. The "scalarization" is a function of the descriptor inputs to the intrinsic, not a function of its output. This should be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remove it and I added two tests : scalarizationOfDescriptorLoadTest1.ll and scalarizationOfDescriptorLoadTest2.ll
lgc/builder/BuilderImpl.cpp
Outdated
newLoad->insertBefore(nonUniformInst); | ||
nonUniformInst->setOperand(operandIdx, newLoad); | ||
newGep->insertBefore(newLoad); | ||
newGep->setOperand(1, readFirstLane); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to check that operand 1 is actually the firstIndexInst.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should not do it in the new version of the code.
lgc/builder/BuilderImpl.cpp
Outdated
auto *loadOp = dyn_cast<LoadInst>(OrigLoad); | ||
auto *gepOp = dyn_cast<GetElementPtrInst>(OrigGep); | ||
auto *AndOp = dyn_cast<Instruction>(OrigAnd); | ||
auto *extractOp = dyn_cast<ExtractElementInst>(OrigExtract); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should all just be cast
. No reason to use dyn_cast
after previously doing all the match checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for pointing out :) Not needed any more.
Yet another follow-up to this... wouldn't it be possible to collect the relevant instructions in |
I agree capturing the visited instruction chain from Longer term I have been suggesting that we move this lowering to separate pass later in the compiler and simply annotate the output of non-uniform instructions as |
@nhaehnle : Thank you for the suggestion. I will try to capture the instruction chain from traceNonUniformIndex . |
lgc/builder/BuilderImpl.cpp
Outdated
@@ -35,9 +35,11 @@ | |||
#include "lgc/state/TargetInfo.h" | |||
#include "llvm/IR/IntrinsicInst.h" | |||
#include "llvm/IR/IntrinsicsAMDGPU.h" | |||
#include "llvm/IR/PatternMatch.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use the namespace directly into the method emitWaterfallBeginForScalarizedLoops
?
7033caf
to
23cd027
Compare
@tsymalla-AMD : I removed all the pattern match code :) |
Test summary for commit 23cd027CTS tests (Failed: 32/138994)
Ubuntu navi3x, SrdcvkUbuntu navi2x, Srdcvk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like some CTS are failing, but I did not check if those are related.
23cd027
to
d180bcf
Compare
Test summary for commit d180bcfCTS tests (Failed: 32/138994)
Ubuntu navi3x, SrdcvkUbuntu navi2x, Srdcvk |
Test summary for commit 43a7a13CTS tests (Failed: 0/138443)
Ubuntu navi3x, SrdcvkUbuntu navi2x, Srdcvk |
lgc/builder/BuilderImpl.cpp
Outdated
// Initialization of instrToIndex and indexToInstr. | ||
if (scalarizeDescriptorLoads) { | ||
unsigned cnt = 0; | ||
for (Instruction *I = nonUniformInst->getPrevNode(); I != nullptr && cnt < 64; I = I->getPrevNode(), ++cnt) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the maximum number of steps something we should parametrize (or at least provide it as easily parametrizable value)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
lgc/builder/BuilderImpl.cpp
Outdated
return; | ||
} | ||
assert(instrDeps.contains(currentVisitedInsn) && "The current visited instruction should have been in the map."); | ||
auto it1 = instrDeps.try_emplace(newValue).first; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use a structured binding so this is a bit cleaner?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if it is needed because try_empace() returns a pair of an interator and bool. The latter is true if the newValue was inserted in the map and it is false if it was already existed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but you can check for the boolean value directly. However, it is just personal taste.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not using the boolean. That's why I did not add it.
auto &setOfInsns = it1->second; | ||
auto it2 = instrDeps.find(currentVisitedInsn); | ||
const auto &set = it2->second; | ||
for (auto it3 = set.begin(indexToInstr), ite = set.end(indexToInstr); it3 != ite; ++it3) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use a range based for loop instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not do it because begin() and end() of TinyInstructionSet have an argument. Please let me know if there is a way to do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, you originally stored indexToInstr as a member reference, right? With that approach, it would have worked
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about not storing the reference to indexToInstr and instead add a separate member function that returns an iterator_range
on which you can iterate?
iterator_range<IteratorT> getIteratorRange(IndexToInstrMap &indexToInstr) {
return iterator_range(set.begin(indexToInstr), set.end(indexToInstr));
}
That should allow you to make use of the range based for loop and not increase the storage size of the TinyInstructionSet?
43a7a13
to
bbe0a57
Compare
Test summary for commit bbe0a57CTS tests (Failed: 0/138443)
Ubuntu navi3x, SrdcvkUbuntu navi2x, Srdcvk |
504d64b
to
8179bc4
Compare
Test summary for commit 8179bc4CTS tests (Failed: 0/138378)
Ubuntu navi3x, SrdcvkUbuntu navi2x, Srdcvk |
Test summary for commit 504d64bCTS tests (Failed: 0/138443)
Ubuntu navi3x, SrdcvkUbuntu navi2x, Srdcvk |
void TraceNonUniformIndex::insertNewValueInInstrDeps(Value *newValue, Instruction *currentVisitedInstr) { | ||
if (!instrToIndex.contains(currentVisitedInstr)) { | ||
// The instruction is either outside of 64 limit or in a different basic block. So, we bail-out scalarization. | ||
scalarizeDescriptorLoads = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems setting TraceNonUniformIndex.scalarizeDescriptorLoads to false
here means that the next time TraceNonUniformIndex::run is invoked it will not try to scalarize. That means if there are two non-uniform operands of, say, image.sample, if the first one could not be scalarized, then the second one would not be either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is how the existing code (before my changes) works. I am working to change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. Can you extend the comment to make it clear that this will also affect scalarization of other operands?
Implementation-wise I guess it is fine to address it in a follow-up commit to keep the scope of the changes smaller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in #2859
|
||
void TraceNonUniformIndex::insertNewValueInInstrDeps(Value *newValue, Instruction *currentVisitedInstr) { | ||
if (!instrToIndex.contains(currentVisitedInstr)) { | ||
// The instruction is either outside of 64 limit or in a different basic block. So, we bail-out scalarization. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or in a different basic block
I know this is a part of the design of this patch, but do we really have to give up the scalarization if an instruction is in a different block?
A descriptor load in a different basic block than the image op is quite common, and with the changes here we would stop scalarizing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not part of the design of this patch. This is how the existing code (before my changes) works because there is not backend support for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sorry, I misspoke - this is not really related to your patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in #2877
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the code changes are good. I think Piotr's points are fair but let's address them in separate PRs that you can work on while we're working out the last remaining kinks here.
There's still a thing I noticed inline, and the LGC tests ought to be simplified. As-is, they're touching a bunch of unrelated things like reading VS inputs. Please take a look at other tests, especially in lgc/test/Transforms/{Combine,Lower}CooperativeMatrix for examples of tests that are focused on shorter instruction sequences.
8179bc4
to
627712f
Compare
@nhaehnle : I am working on refactoring the tests. |
Test summary for commit 627712fCTS tests (Failed: 0/138443)
Ubuntu navi3x, SrdcvkUbuntu navi2x, Srdcvk |
627712f
to
195b936
Compare
Done :) |
Test summary for commit 195b936CTS tests (Failed: 0/138443)
Ubuntu navi3x, SrdcvkUbuntu navi2x, Srdcvk |
I refactored the tests as much as I could. |
ping |
1 similar comment
ping |
This patch refactors the code that pushes the loads inside the waterfall loop: