-
Notifications
You must be signed in to change notification settings - Fork 434
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid calling useResource on resources in argument buffers #2402
base: main
Are you sure you want to change the base?
Conversation
@synchronized (_physicalDevice->getMTLDevice()) { | ||
for (auto fence: _activeBarriers[stage]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vulkan barriers run in submission order, so the fact that this is on MVKDevice (and requires synchronization) worries me
Have you tested what happens if e.g. you encode command buffers in immediate mode and then submit them in the opposite order that you encoded them? Yes, it won't crash thanks to the @synchronized
but the fact that this is in a place that requires synchronization at all means that two threads could fight over the _activeBarriers list and probably do unexpected (but non-crashy) things.
Also, any reason you're retaining and releasing all the fences? Don't they live as long as the MVKDevice (which according to Vulkan should outlive any active work on it)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, that's a good point about keeping the fences there, in addition to the multiple queue problem.
Maybe I could avoid requiring to encode only after submit (which would let us keep fences on MVKQueue) by keeping most fences local to the command buffer, and doing some boundary trick to synchronize between submissions on the queue. Not sure what that trick is yet.
The fences are currently only supposed to live as long as the last command buffer that uses them. When one gets removed from all wait/update slots, the only references left are those attached to the command buffer. It sure is more retaining and releasing than I originally expected, so I might just pull the trigger and keep a fixed number of reusable fences..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One possibility is to make sure the last group in a submission always updates a known fence, and then always start with waiting on that fence on new submissions:
1 2 3 4 5 6
avb avb B bvc bvc B cva cva
f f B bf bf B cf cf
(And if you go the reusable fence route, just have everyone use the same array of fences. Always start at index 0, and update index 0 at the end of a submission. Note that fences in Metal, like barriers in Vulkan, also work in submission order, so the worst that could happen using the same fences across multiple encoders at once is more synchronization than you wanted, but assuming you don't mix fences for different pipeline stages, I don't think that will be a big issue.)
Since there are a few design and implementation points under discussion, I've moved this to WIP. |
// Initialize fences for execution barriers | ||
for (auto &stage: _barrierFences) for (auto &fence: stage) fence = [_physicalDevice->getMTLDevice() newFence]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you give the fences labels like [fence setLabel:[NSString stringWithFormat:@"%s Fence %d", stageName(stage), idx]]
? Would be very convenient for debugging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, pushed it.
Note that I removed the host stage, I don't think it needs to be explicit, but there probably? should be some waits in |
My understanding is that Metal guarantees memory coherency once you're able to observe that an operation has completed (e.g. through a shared event or by checking the completed status of a command buffer), so I think this is correct, since you'd need to do the same even with the host memory barrier in Vulkan.
|
Alright, my concern with |
@js6i I see you've removed the WIP tag. Is this PR ready for overall review and merging? |
Yes, I meant to submit it for review. |
This PR implements execution barriers with Metal fences and puts all resources in a residency set to avoid having to useResource all resources in bound argument buffers. That makes it possible to run programs that use descriptor indexing with large descriptor tables efficiently.
Consider a pipeline executing some render passes with a couple vertex to fragment barriers:
Here
v
andf
symbolize the vertex and fragment stages of a render pass, andB
stands for the barrier.In this example, stages
v1
andv2
need to run beforef3..6
, andv1..4
beforef5
andf6
.To implement this I maintain a set of fences that will be waited on before each stage, and updated after it. Here's a diagram with the fences
a
andb
placed before the stage symbol when waited on, and after when updated:Here
v1
updates fencea
,v4
waits fora
and updatesb
,f4
waits fora
, etc.Note that the synchronization is a little stronger than the original -
v3..6
are forced to execute afterv1
andv2
. This is for practical reasons - I want to keep a constant, limited set of fences active, only wait for one fence per stage pair, and only update one fence per stage.There's some things that could be improved here: