Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interleaved multi-queue #3096

Merged
merged 2 commits into from
Nov 3, 2023
Merged

Interleaved multi-queue #3096

merged 2 commits into from
Nov 3, 2023

Conversation

GertyP
Copy link
Contributor

@GertyP GertyP commented Oct 19, 2023

Description

I remember privately sending renderdoc feedback a few years ago after discovering that it couldn't handle submission to multiple queues (graphics and multiple async compute) where the commands, waits, and signals for each queue are submitted in the entirety of the frame's work, one queue at a time and where there were interdependencies between queues. I.e. work doesn't get carefully submitted to each queue in an order that avoids the GPU ever getting stuck on a wait.

I'm please to see that this appears to no longer be an issue with renderdoc (at least with this simple new test d3d12_interleaved_multi_queue) but since I can't see any other multi-queue d3d12 tests (I could only see only single graphics or single compute queue tests), I thought it might be worth adding this to the set of d3d12 demos/tests.

I don't know how you feel about requiring clean separation of different test elements but, because it was relatively easy to add (and something that also doesn't appear to be covered in the existing d3d12 demos/tests), this new test includes optional support for using dynamic descriptor heap reasources (suitably checking for hardware support at run-time). If this is a big deal, I can separate just that out into a different test, if that's something you'd like.

Added 'D3D12BufferCreator::InitialState(D3D12_RESOURCE_STATES)' to allow
user code to avoid adding unnecessary transition barriers after resource
creation.

Added 'D3D12TextureCreator::ClearVal(const D3D12_CLEAR_VALUE &)' to avoid
a PIX performance warning/tip about clearing resources that haven't had
their clear value set on creation.
@GertyP GertyP changed the title Tests Interleaved multi-queue Oct 19, 2023
@baldurk
Copy link
Owner

baldurk commented Oct 19, 2023

Thanks for this, yes I added in syncs between queue submission changes which should handle at least some case of async queue usage.

Regarding your second point, combined tests makes sense if there is some interaction of features but there's no relation between multiple queue testing and SM6.6 testing so I wouldn't want that included in this test. There's also already a test for SM6.6 in D3D12_Descriptor_Indexing which is specifically designed for testing that kind of functionality.

@GertyP
Copy link
Contributor Author

GertyP commented Oct 19, 2023

There's also already a test for SM6.6 in D3D12_Descriptor_Indexing

Thanks. I see it now. I think I'll have searched for D3D12_RESOURCE_BINDING_TIER_3, which isn't a checked feature in that test but I think is required for dynamic descriptor resources. I believed two separate features/aspects needed checking but I wonder whether testing for D3D_SHADER_MODEL_6_6 always implies support for D3D12_RESOURCE_BINDING_TIER_3 or whether there's some weird platform that supports the first without the second. 🤔
Anyway, I'll remove descriptor heap indexing.

@GertyP
Copy link
Contributor Author

GertyP commented Oct 23, 2023

The requested change was made, by the way. I stripped out the 'dynamic resource' / 'descriptor heap indexing' aspect and squashed it into the one commit.

Copy link
Owner

@baldurk baldurk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks mostly fine, I've left a couple of minor nitpick style comments.

As an overall thing though I wanted to clarify - it's ambiguous to me what exactly you're trying to test here. The way the test is constructed this looks like it will replay in RenderDoc, however it won't be replaying precisely as described because RenderDoc does not correctly handle wait-before-signal. If you for example had the compute queues read some data filled out by the graphics queue in the 'clear RTs' step, they would not see it because each submit is processed in CPU-submission order, without respect to waits and signals.

If you're only testing to make sure that overlap without wait-before-signal is handled then that's fine, and this test will do it, but with the way you've constructed it I'm not clear exactly what case you're trying to test.

In addition to that, this demo is definitely useful but by itself it is missing the testing logic that would actually cause it to run in the nightly CI. That would involve writing a python script to run the test, grab a capture, and test whatever you're looking to test. In this case I think that would involve looking for the right colours to be produced.

{// DescriptorTable / D3D12_ROOT_DESCRIPTOR_TABLE1
sizeof(tableDescriptorRanges) / sizeof(tableDescriptorRanges[0]), // NumDescriptorRanges
&tableDescriptorRanges[0]},
D3D12_SHADER_VISIBILITY_ALL};
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor formatting nit - putting trailing commas in these arrays will make clang format lay them out a bit clearer. Not required but it would probably improve things.

Also just to check - is it not possible to formulate these bindings with the existing tableParam() helper? It would require two root parameters instead of one with two ranges, but does that affect the test at all? If not it would be nice to remove the boilerplate here by using that helper even if it's not optimal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made the change from a single root descriptor table of 2-ranges to 2 root tables (using tableParam).

I can understand if you prefer to use multiple instances of tableParam, especially when root parameters aren't at a premium, which is the case here. I did it that way initially just because that seemed the most usual way of doing it and with the slightly unusual way tableParam builds up hidden internal table ranges, I couldn't actually get the exact descriptor table I had in mind ... but I'm fine with changing the structure a little to cut down on a few lines.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The autotest project primarily focusses on testing what we aim to test with the simplest/most basic code possible so that when reading the code you focus on what the test is stressing. It's definitely not optimal but unless the test is specifically about descriptor ranges it's the most straightforward way to express it yeh.

.DSV()
.NoSRV()
.InitialState(D3D12_RESOURCE_STATE_DEPTH_WRITE)
.ClearVal(clearVal);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW this is totally fine to go to the lengths of setting the clear value, but I typically disable/ignore that warning as extremely noisy and extremely low value.

ID3D12GraphicsCommandListPtr cmd = GetCommandBuffer();
Reset(cmd);
ResourceBarrier(cmd, vb, D3D12_RESOURCE_STATE_COMMON,
D3D12_RESOURCE_STATE_VERTEX_AND_CONSTANT_BUFFER);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit - there is an overload of this ResourceBarrier helper that doesn't take a command buffer and does the reset/submit dance for you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. That's nicer. Thanks.

@@ -477,4 +478,7 @@
<Visible>false</Visible>
</Content>
</ItemGroup>
<ItemGroup>
<None Include="..\..\..\.clang-format" />
</ItemGroup>
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for including the .clang-format file here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question: I missed that. Don't know. I think it may have been me trying to get VS(2022) to adopt that formatting style (or me investigating whether VS would auto-format differently).

Have now removed it.


// Transition resources back to their expected states at the start of the next frame
ResourceBarrier(cmd, buffer, D3D12_RESOURCE_STATE_PIXEL_SHADER_RESOURCE,
D3D12_RESOURCE_STATE_UNORDERED_ACCESS);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you want to clear the buffer here as well, since otherwise after the first loop the test doesn't actually demonstrate that the compute data was written properly before the graphics work ran - it's overwriting the same data over and over so it could produce the same output even if the replay is wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after the first loop the test doesn't actually demonstrate that the compute data was written properly before the graphics work ran

The test now first resets the buffer to a pure blue from the gfx queue, then the compute queues read and add to the colours, adding brown and green so that the gfx pixel shading colour the triangles purple and sky blue. That's ultimately what we'd like the replay to show but, for now at least, the replay draws both tris pure blue, as expected because of the current behaviour you've described.

@GertyP
Copy link
Contributor Author

GertyP commented Oct 27, 2023

Thanks for the feedback.
I'll hopefully get time to respond and make changes soon.

@GertyP
Copy link
Contributor Author

GertyP commented Oct 27, 2023

it won't be replaying precisely as described because RenderDoc does not correctly handle wait-before-signal. If you for example had the compute queues read some data filled out by the graphics queue in the 'clear RTs' step, they would not see it because each submit is processed in CPU-submission order, without respect to waits and signals.

I see. This was the kind of thing that I wasn't sure about if/how Renderdoc would handle this submission style (entire frame's sequence for each GPU queue at a time) and so thought this test would either fail in some way and reveal to me the bits of Rd code involved, what we have available in order to more appropriately serialise and spoon-feed the GPU with captured queue sequences for replaying, and what options look best to handle this and correctly replay as it would on Pix (or how the GPU would consume the work).

Since this style of submission is normal/legitimate (and also quite clean and convenient from the gfx application's point of view), the real goal I have in mind is to aim towards Renderdoc having correct multi-queue dependency replay serialisation, which you point out is not yet sufficiently tested by these changes, so I've added an initial gfx buffer write which we then read-modify-write in subsequent compute queue work.

In this case, the GPU correctly renders a sky-blue and purple triangle but because of what you've explained Rdoc does, the capture replay outputs two pure blue triangles.

So you can see the individual changes for each issue you mention, I've kept the changes separate for now and will probably squash them all into the first commit (if you agree), since they all fall under the 'add interleaved multi-queue test'.

@GertyP
Copy link
Contributor Author

GertyP commented Oct 27, 2023

it is missing the testing logic that would actually cause it to run in the nightly CI

Am I right in thinking that the current CI (I'm just looking at .github/workflows/ci.yml) doesn't yet do any whole system image testing; only small, focused unit testing logic at moment?

I'm not yet familiar with the pythonised pixel testing side of things but is that something I could add as a separate PR and is this kind of thing fairly straightforward to do through the 'util\test\tests\D3D12\D3D12_...py' test files?

Initially, because I wouldn't want to add a failing test and because I would expect Renderdoc's replay, in its current state, not to result in the sky blue and purple triangles, I'd aim to have a test that currently expects the pure blue 'reset buffer' colours for now.

@GertyP GertyP requested a review from baldurk October 27, 2023 19:06
@baldurk
Copy link
Owner

baldurk commented Oct 30, 2023

The github CI is only unit tests because it doesn't run on a machine with a capable GPU so it can't run any kind of functional tests. I run the functional tests nightly and upload the results as part of the nightly builds, on a machine with a GPU.

Indeed the separate python scripts are basically little harnesses to look at a capture and do something things like e.g. check that the triangles are rendered and in the correct colours. It would be simplest not to commit one of those yet, but you could also submit it disabled so it doesn't run yet (each test has a function to check for support, in case of e.g. hardware requirements, which could be set to return false for now).

It sounds like what you're specifically wanting to support and test then is wait-before-signal? Indeed that is not supported by RenderDoc because it is awkward to handle and confusing to display to the user since you no longer have a linear view of the commands in the frame.

It would be best to rename the test & commit to explicitly state that you are testing wait-before-signal, since that is a specific case and not the same as generalised async compute/interleaved queue operations. Other than that yes if you squash this all into one commit it can be merged then now that I understand what you're wanting to test.

This tests submitting commands, signals, and waits to graphics and
multiple compute queues where the entire frame's submission for each
queue depends on the other queues and the submission order is not the
order in which the GPU (and replay) need to be processed.  This is to
ensure any capture or replaying deserialises adequately.
@GertyP
Copy link
Contributor Author

GertyP commented Nov 2, 2023

Thanks.
Renamed and squashed.

(I thought it useful to still keep the 'multi' in the name, just in case people who aren't familiar with the tests and who might just be browsing them can get a hint that there's multiple queue submission work going on too; 'wait-before-signal' doesn't immediately convey 'interleaved, out-of-render-order, multiple queues' to me. Anyone else wanting to do similar multi-queue testing may more quickly locate and understand what's going on. For example, I was hunting around for the kind of test boilerplate required to do this new multi-queue test amongst the existing tests before realising they don't have what I needed... And if that ever happens, I think that would be the time to factor out any of the shared utility from stuff like the little class ComputeQueues helper.)

@baldurk baldurk merged commit 7619414 into baldurk:v1.x Nov 3, 2023
16 checks passed
@GertyP GertyP deleted the tests branch November 4, 2023 11:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants