-
Notifications
You must be signed in to change notification settings - Fork 717
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPIR-V] Inefficient codegen when writing to RWByteAddressBuffer #7089
Comments
The SPIR-V spec eludes me, and I'm not sure RWByteAddressBuffer Basically it couldn't handle anything past 32 bytes (and that would emit the Another problem is that when casting between two pointers, the Putting aside the issue of the allowed vector length, you still run into the problem of having to bitcast your storage value However that would mean having multiple aliased declarations of the same SSBO which can no longer be restrict, also there's the issue of alignment/offset into the variable length array. P.S. Its a bit weird that every single index of a |
@devshgraphicsprogramming That's a shame, maybe there's some other way besides having to emit something that uses BDA, though BDA could be a possibility if it's enabled via the spirv path explicitly. If there somehow is a way to get the BDA of a descriptor? which I'm not sure about. |
From the land of validation layers, yes *
I haven't read the Descriptor Buffer spec, but it may be that the N byte opaque handles for an SSBO contain that BDA somewhere. |
I cannot find the original HLSL, so I'm assuming this is something like https://godbolt.org/. From the DXC code generation perspective, there is not much we can do. If we want to reduce the number of stores, we could try to represent the byte address buffers an array of Longer term (in Clang, but not DXC), I want to represent ByteAddressBuffers using untyped pointers. Then the store turns into a single store. I'll leave this open in case someone wants to try to work on modifying the current representation. My team will not get to it. This could happen in two steps:
I would do it as an optimization pass because it will be easier to test. We want the FE to have as few code paths as possible. |
Description
When writing/reading to/from a RWByteAddressBuffer, it emits a lot of loads/writes using a uint each.
Even though the driver might catch this and fix it at PSO creation time certain validation tools might still see performance degradations.
In this issue: KhronosGroup/Vulkan-ValidationLayers#9317 (comment) compilation takes a loooong time because every access needs to be validated individually.
If there is a way to reduce this, for example by using bitcasts as suggested by devshgraphicsprogramming in a different issue (#7038 (comment)) it could help a lot to reduce overhead with validation layers and potentially other tools as well. Maybe also drivers that are written less well can also benefit from this.
Steps to Reproduce
RWByteAddressBuffer or ByteAddressBuffer load/write with a relatively big struct. This will emit a lot of bloat https://godbolt.org/z/qE95f8jvb. Where each index needs to be bounds checked by the tool.
Actual Behavior
Slowdowns on validation tools.
Environment
The text was updated successfully, but these errors were encountered: