Interleaved multi-queue #3096

GertyP · 2023-10-19T14:33:52Z

Description

I remember privately sending renderdoc feedback a few years ago after discovering that it couldn't handle submission to multiple queues (graphics and multiple async compute) where the commands, waits, and signals for each queue are submitted in the entirety of the frame's work, one queue at a time and where there were interdependencies between queues. I.e. work doesn't get carefully submitted to each queue in an order that avoids the GPU ever getting stuck on a wait.

I'm please to see that this appears to no longer be an issue with renderdoc (at least with this simple new test d3d12_interleaved_multi_queue) but since I can't see any other multi-queue d3d12 tests (I could only see only single graphics or single compute queue tests), I thought it might be worth adding this to the set of d3d12 demos/tests.

I don't know how you feel about requiring clean separation of different test elements but, because it was relatively easy to add (and something that also doesn't appear to be covered in the existing d3d12 demos/tests), this new test includes optional support for using dynamic descriptor heap reasources (suitably checking for hardware support at run-time). If this is a big deal, I can separate just that out into a different test, if that's something you'd like.

Added 'D3D12BufferCreator::InitialState(D3D12_RESOURCE_STATES)' to allow user code to avoid adding unnecessary transition barriers after resource creation. Added 'D3D12TextureCreator::ClearVal(const D3D12_CLEAR_VALUE &)' to avoid a PIX performance warning/tip about clearing resources that haven't had their clear value set on creation.

baldurk · 2023-10-19T14:45:56Z

Thanks for this, yes I added in syncs between queue submission changes which should handle at least some case of async queue usage.

Regarding your second point, combined tests makes sense if there is some interaction of features but there's no relation between multiple queue testing and SM6.6 testing so I wouldn't want that included in this test. There's also already a test for SM6.6 in D3D12_Descriptor_Indexing which is specifically designed for testing that kind of functionality.

GertyP · 2023-10-19T15:12:37Z

There's also already a test for SM6.6 in D3D12_Descriptor_Indexing

Thanks. I see it now. I think I'll have searched for D3D12_RESOURCE_BINDING_TIER_3, which isn't a checked feature in that test but I think is required for dynamic descriptor resources. I believed two separate features/aspects needed checking but I wonder whether testing for D3D_SHADER_MODEL_6_6 always implies support for D3D12_RESOURCE_BINDING_TIER_3 or whether there's some weird platform that supports the first without the second. 🤔
Anyway, I'll remove descriptor heap indexing.

GertyP · 2023-10-23T10:20:02Z

The requested change was made, by the way. I stripped out the 'dynamic resource' / 'descriptor heap indexing' aspect and squashed it into the one commit.

baldurk

I think this looks mostly fine, I've left a couple of minor nitpick style comments.

As an overall thing though I wanted to clarify - it's ambiguous to me what exactly you're trying to test here. The way the test is constructed this looks like it will replay in RenderDoc, however it won't be replaying precisely as described because RenderDoc does not correctly handle wait-before-signal. If you for example had the compute queues read some data filled out by the graphics queue in the 'clear RTs' step, they would not see it because each submit is processed in CPU-submission order, without respect to waits and signals.

If you're only testing to make sure that overlap without wait-before-signal is handled then that's fine, and this test will do it, but with the way you've constructed it I'm not clear exactly what case you're trying to test.

In addition to that, this demo is definitely useful but by itself it is missing the testing logic that would actually cause it to run in the nightly CI. That would involve writing a python script to run the test, grab a capture, and test whatever you're looking to test. In this case I think that would involve looking for the right colours to be produced.

baldurk · 2023-10-25T14:16:19Z

util/test/demos/d3d12/d3d12_interleaved_multi_queue.cpp

+        {// DescriptorTable / D3D12_ROOT_DESCRIPTOR_TABLE1
+         sizeof(tableDescriptorRanges) / sizeof(tableDescriptorRanges[0]),    // NumDescriptorRanges
+         &tableDescriptorRanges[0]},
+        D3D12_SHADER_VISIBILITY_ALL};


Minor formatting nit - putting trailing commas in these arrays will make clang format lay them out a bit clearer. Not required but it would probably improve things.

Also just to check - is it not possible to formulate these bindings with the existing tableParam() helper? It would require two root parameters instead of one with two ranges, but does that affect the test at all? If not it would be nice to remove the boilerplate here by using that helper even if it's not optimal.

I've made the change from a single root descriptor table of 2-ranges to 2 root tables (using tableParam).

I can understand if you prefer to use multiple instances of tableParam, especially when root parameters aren't at a premium, which is the case here. I did it that way initially just because that seemed the most usual way of doing it and with the slightly unusual way tableParam builds up hidden internal table ranges, I couldn't actually get the exact descriptor table I had in mind ... but I'm fine with changing the structure a little to cut down on a few lines.

The autotest project primarily focusses on testing what we aim to test with the simplest/most basic code possible so that when reading the code you focus on what the test is stressing. It's definitely not optimal but unless the test is specifically about descriptor ranges it's the most straightforward way to express it yeh.

baldurk · 2023-10-25T14:17:24Z

util/test/demos/d3d12/d3d12_interleaved_multi_queue.cpp

+                                   .DSV()
+                                   .NoSRV()
+                                   .InitialState(D3D12_RESOURCE_STATE_DEPTH_WRITE)
+                                   .ClearVal(clearVal);


FWIW this is totally fine to go to the lengths of setting the clear value, but I typically disable/ignore that warning as extremely noisy and extremely low value.

baldurk · 2023-10-25T14:18:57Z

util/test/demos/d3d12/d3d12_interleaved_multi_queue.cpp

+      ID3D12GraphicsCommandListPtr cmd = GetCommandBuffer();
+      Reset(cmd);
+      ResourceBarrier(cmd, vb, D3D12_RESOURCE_STATE_COMMON,
+                      D3D12_RESOURCE_STATE_VERTEX_AND_CONSTANT_BUFFER);


Minor nit - there is an overload of this ResourceBarrier helper that doesn't take a command buffer and does the reset/submit dance for you.

Done. That's nicer. Thanks.

baldurk · 2023-10-25T14:20:52Z

util/test/demos/demos.vcxproj

@@ -477,4 +478,7 @@
      <Visible>false</Visible>
    </Content>
  </ItemGroup>
+  <ItemGroup>
+    <None Include="..\..\..\.clang-format" />
+  </ItemGroup>


What's the reason for including the .clang-format file here?

Good question: I missed that. Don't know. I think it may have been me trying to get VS(2022) to adopt that formatting style (or me investigating whether VS would auto-format differently).

Have now removed it.

baldurk · 2023-10-25T14:28:51Z

util/test/demos/d3d12/d3d12_interleaved_multi_queue.cpp

+
+        // Transition resources back to their expected states at the start of the next frame
+        ResourceBarrier(cmd, buffer, D3D12_RESOURCE_STATE_PIXEL_SHADER_RESOURCE,
+                        D3D12_RESOURCE_STATE_UNORDERED_ACCESS);


I think you want to clear the buffer here as well, since otherwise after the first loop the test doesn't actually demonstrate that the compute data was written properly before the graphics work ran - it's overwriting the same data over and over so it could produce the same output even if the replay is wrong.

after the first loop the test doesn't actually demonstrate that the compute data was written properly before the graphics work ran

The test now first resets the buffer to a pure blue from the gfx queue, then the compute queues read and add to the colours, adding brown and green so that the gfx pixel shading colour the triangles purple and sky blue. That's ultimately what we'd like the replay to show but, for now at least, the replay draws both tris pure blue, as expected because of the current behaviour you've described.

GertyP · 2023-10-27T10:14:55Z

Thanks for the feedback.
I'll hopefully get time to respond and make changes soon.

GertyP · 2023-10-27T18:47:51Z

it won't be replaying precisely as described because RenderDoc does not correctly handle wait-before-signal. If you for example had the compute queues read some data filled out by the graphics queue in the 'clear RTs' step, they would not see it because each submit is processed in CPU-submission order, without respect to waits and signals.

I see. This was the kind of thing that I wasn't sure about if/how Renderdoc would handle this submission style (entire frame's sequence for each GPU queue at a time) and so thought this test would either fail in some way and reveal to me the bits of Rd code involved, what we have available in order to more appropriately serialise and spoon-feed the GPU with captured queue sequences for replaying, and what options look best to handle this and correctly replay as it would on Pix (or how the GPU would consume the work).

Since this style of submission is normal/legitimate (and also quite clean and convenient from the gfx application's point of view), the real goal I have in mind is to aim towards Renderdoc having correct multi-queue dependency replay serialisation, which you point out is not yet sufficiently tested by these changes, so I've added an initial gfx buffer write which we then read-modify-write in subsequent compute queue work.

In this case, the GPU correctly renders a sky-blue and purple triangle but because of what you've explained Rdoc does, the capture replay outputs two pure blue triangles.

So you can see the individual changes for each issue you mention, I've kept the changes separate for now and will probably squash them all into the first commit (if you agree), since they all fall under the 'add interleaved multi-queue test'.

GertyP · 2023-10-27T19:06:32Z

it is missing the testing logic that would actually cause it to run in the nightly CI

Am I right in thinking that the current CI (I'm just looking at .github/workflows/ci.yml) doesn't yet do any whole system image testing; only small, focused unit testing logic at moment?

I'm not yet familiar with the pythonised pixel testing side of things but is that something I could add as a separate PR and is this kind of thing fairly straightforward to do through the 'util\test\tests\D3D12\D3D12_...py' test files?

Initially, because I wouldn't want to add a failing test and because I would expect Renderdoc's replay, in its current state, not to result in the sky blue and purple triangles, I'd aim to have a test that currently expects the pure blue 'reset buffer' colours for now.

baldurk · 2023-10-30T15:51:17Z

The github CI is only unit tests because it doesn't run on a machine with a capable GPU so it can't run any kind of functional tests. I run the functional tests nightly and upload the results as part of the nightly builds, on a machine with a GPU.

Indeed the separate python scripts are basically little harnesses to look at a capture and do something things like e.g. check that the triangles are rendered and in the correct colours. It would be simplest not to commit one of those yet, but you could also submit it disabled so it doesn't run yet (each test has a function to check for support, in case of e.g. hardware requirements, which could be set to return false for now).

It sounds like what you're specifically wanting to support and test then is wait-before-signal? Indeed that is not supported by RenderDoc because it is awkward to handle and confusing to display to the user since you no longer have a linear view of the commands in the frame.

It would be best to rename the test & commit to explicitly state that you are testing wait-before-signal, since that is a specific case and not the same as generalised async compute/interleaved queue operations. Other than that yes if you squash this all into one commit it can be merged then now that I understand what you're wanting to test.

This tests submitting commands, signals, and waits to graphics and multiple compute queues where the entire frame's submission for each queue depends on the other queues and the submission order is not the order in which the GPU (and replay) need to be processed. This is to ensure any capture or replaying deserialises adequately.

GertyP · 2023-11-02T18:07:39Z

Thanks.
Renamed and squashed.

(I thought it useful to still keep the 'multi' in the name, just in case people who aren't familiar with the tests and who might just be browsing them can get a hint that there's multiple queue submission work going on too; 'wait-before-signal' doesn't immediately convey 'interleaved, out-of-render-order, multiple queues' to me. Anyone else wanting to do similar multi-queue testing may more quickly locate and understand what's going on. For example, I was hunting around for the kind of test boilerplate required to do this new multi-queue test amongst the existing tests before realising they don't have what I needed... And if that ever happens, I think that would be the time to factor out any of the shared utility from stuff like the little class ComputeQueues helper.)

GertyP changed the title ~~Tests~~ Interleaved multi-queue Oct 19, 2023

GertyP force-pushed the tests branch from 799232b to bce2f1f Compare October 19, 2023 15:35

baldurk requested changes Oct 25, 2023

View reviewed changes

GertyP requested a review from baldurk October 27, 2023 19:06

GertyP force-pushed the tests branch from 6cc1888 to 83ecb5b Compare November 2, 2023 17:35

baldurk merged commit 7619414 into baldurk:v1.x Nov 3, 2023
16 checks passed

GertyP deleted the tests branch November 4, 2023 11:46

GertyP mentioned this pull request Nov 6, 2023

New 'multi wait before signal' pixel checking python test #3114

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interleaved multi-queue #3096

Interleaved multi-queue #3096

GertyP commented Oct 19, 2023

baldurk commented Oct 19, 2023

GertyP commented Oct 19, 2023

GertyP commented Oct 23, 2023

baldurk left a comment

baldurk Oct 25, 2023

GertyP Oct 27, 2023

baldurk Oct 30, 2023

baldurk Oct 25, 2023

baldurk Oct 25, 2023

GertyP Oct 27, 2023

baldurk Oct 25, 2023

GertyP Oct 27, 2023

baldurk Oct 25, 2023

GertyP Oct 27, 2023

GertyP commented Oct 27, 2023

GertyP commented Oct 27, 2023

GertyP commented Oct 27, 2023

baldurk commented Oct 30, 2023

GertyP commented Nov 2, 2023

Interleaved multi-queue #3096

Interleaved multi-queue #3096

Conversation

GertyP commented Oct 19, 2023

Description

baldurk commented Oct 19, 2023

GertyP commented Oct 19, 2023

GertyP commented Oct 23, 2023

baldurk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GertyP commented Oct 27, 2023

GertyP commented Oct 27, 2023

GertyP commented Oct 27, 2023

baldurk commented Oct 30, 2023

GertyP commented Nov 2, 2023