Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPIR-V] Inefficient codegen when writing to RWByteAddressBuffer #7089

Open
Nielsbishere opened this issue Jan 27, 2025 · 4 comments
Open

[SPIR-V] Inefficient codegen when writing to RWByteAddressBuffer #7089

Nielsbishere opened this issue Jan 27, 2025 · 4 comments
Labels
bug Bug, regression, crash spirv Work related to SPIR-V
Milestone

Comments

@Nielsbishere
Copy link
Contributor

Description
When writing/reading to/from a RWByteAddressBuffer, it emits a lot of loads/writes using a uint each.
Even though the driver might catch this and fix it at PSO creation time certain validation tools might still see performance degradations.
In this issue: KhronosGroup/Vulkan-ValidationLayers#9317 (comment) compilation takes a loooong time because every access needs to be validated individually.
If there is a way to reduce this, for example by using bitcasts as suggested by devshgraphicsprogramming in a different issue (#7038 (comment)) it could help a lot to reduce overhead with validation layers and potentially other tools as well. Maybe also drivers that are written less well can also benefit from this.

Steps to Reproduce
RWByteAddressBuffer or ByteAddressBuffer load/write with a relatively big struct. This will emit a lot of bloat https://godbolt.org/z/qE95f8jvb. Where each index needs to be bounds checked by the tool.

Actual Behavior
Slowdowns on validation tools.

Environment

  • DXC version: Latest from a few days back
  • Host Operating System: N/A (though the issue happens on Android)
@Nielsbishere Nielsbishere added bug Bug, regression, crash needs-triage Awaiting triage spirv Work related to SPIR-V labels Jan 27, 2025
@devshgraphicsprogramming

The SPIR-V spec eludes me, and I'm not sure RWByteAddressBuffer OpBitCast could cast from type T to anything else than an integer scalar or vector.

Basically it couldn't handle anything past 32 bytes (and that would emit the ShaderUint64 capability), or odd bytesizes of 5,7,9, etc. and multiples of 2 past 8, or multiples of 3 past 15 due to no such SPIR-V vectors existing (or can SPIR-V vectors have more than 4 components?).

Another problem is that when casting between two pointers, the OpBitcast cannot cast between two different storage classes which murders the dream here.

Putting aside the issue of the allowed vector length, you still run into the problem of having to bitcast your storage value T to said vector<uintK_t,N> (such that N*K= sizeof(T)) temporary variable and then OpStore-ing that value to an SSBO.

However that would mean having multiple aliased declarations of the same SSBO which can no longer be restrict, also there's the issue of alignment/offset into the variable length array.

P.S. Its a bit weird that every single index of a uint accessing a DWORD gets bound checked, when its only the first and the last that should get bounds checked.

@Nielsbishere
Copy link
Contributor Author

@devshgraphicsprogramming That's a shame, maybe there's some other way besides having to emit something that uses BDA, though BDA could be a possibility if it's enabled via the spirv path explicitly. If there somehow is a way to get the BDA of a descriptor? which I'm not sure about.
I agree about the bounds checking and they're working on this, but it wouldn't surprise me if it snuck in there for other tools as well.

@devshgraphicsprogramming

If there somehow is a way to get the BDA of a descriptor? which I'm not sure about.

From the land of validation layers, yes *

  • as long as you're not using EXT_descriptor_buffer or some other thing that can update descriptor set bindings of SSBOs without an API call

I haven't read the Descriptor Buffer spec, but it may be that the N byte opaque handles for an SSBO contain that BDA somewhere.

@damyanp damyanp moved this to For Google in HLSL Triage Jan 29, 2025
@s-perron
Copy link
Collaborator

s-perron commented Feb 5, 2025

I cannot find the original HLSL, so I'm assuming this is something like https://godbolt.org/.

From the DXC code generation perspective, there is not much we can do. If we want to reduce the number of stores, we could try to represent the byte address buffers an array of int4 instead of int. That could reduce the number of stores by a factor of 4. However, it would make the codegen more complicated. We have had a lot of bugs in this part of the code doing layout calculations.

Longer term (in Clang, but not DXC), I want to represent ByteAddressBuffers using untyped pointers. Then the store turns into a single store.

I'll leave this open in case someone wants to try to work on modifying the current representation. My team will not get to it.

This could happen in two steps:

  • Change the representation to int4 and modify the access chains that currently have an index n to have two indicies n/4 and n%4. This will not change the number of loads and stores, but should be functionally correct.
  • Add an optimization pass to spirv-opt that will group loads/stores covering an entire vector into a single load or store.

I would do it as an optimization pass because it will be easier to test. We want the FE to have as few code paths as possible.

@s-perron s-perron moved this from For Google to Triaged in HLSL Triage Feb 5, 2025
@s-perron s-perron removed the needs-triage Awaiting triage label Feb 5, 2025
@s-perron s-perron added this to the Dormant milestone Feb 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bug, regression, crash spirv Work related to SPIR-V
Projects
Status: New
Status: Triaged
Development

No branches or pull requests

3 participants