[SPIR-V] Inefficient codegen when writing to RWByteAddressBuffer #7089

Nielsbishere · 2025-01-27T17:26:21Z

Description
When writing/reading to/from a RWByteAddressBuffer, it emits a lot of loads/writes using a uint each.
Even though the driver might catch this and fix it at PSO creation time certain validation tools might still see performance degradations.
In this issue: KhronosGroup/Vulkan-ValidationLayers#9317 (comment) compilation takes a loooong time because every access needs to be validated individually.
If there is a way to reduce this, for example by using bitcasts as suggested by devshgraphicsprogramming in a different issue (#7038 (comment)) it could help a lot to reduce overhead with validation layers and potentially other tools as well. Maybe also drivers that are written less well can also benefit from this.

Steps to Reproduce
RWByteAddressBuffer or ByteAddressBuffer load/write with a relatively big struct. This will emit a lot of bloat https://godbolt.org/z/qE95f8jvb. Where each index needs to be bounds checked by the tool.

Actual Behavior
Slowdowns on validation tools.

Environment

DXC version: Latest from a few days back
Host Operating System: N/A (though the issue happens on Android)

devshgraphicsprogramming · 2025-01-28T20:28:08Z

The SPIR-V spec eludes me, and I'm not sure RWByteAddressBuffer OpBitCast could cast from type T to anything else than an integer scalar or vector.

Basically it couldn't handle anything past 32 bytes (and that would emit the ShaderUint64 capability), or odd bytesizes of 5,7,9, etc. and multiples of 2 past 8, or multiples of 3 past 15 due to no such SPIR-V vectors existing (or can SPIR-V vectors have more than 4 components?).

Another problem is that when casting between two pointers, the OpBitcast cannot cast between two different storage classes which murders the dream here.

Putting aside the issue of the allowed vector length, you still run into the problem of having to bitcast your storage value T to said vector<uintK_t,N> (such that N*K= sizeof(T)) temporary variable and then OpStore-ing that value to an SSBO.

However that would mean having multiple aliased declarations of the same SSBO which can no longer be restrict, also there's the issue of alignment/offset into the variable length array.

P.S. Its a bit weird that every single index of a uint accessing a DWORD gets bound checked, when its only the first and the last that should get bounds checked.

Nielsbishere · 2025-01-28T20:33:12Z

@devshgraphicsprogramming That's a shame, maybe there's some other way besides having to emit something that uses BDA, though BDA could be a possibility if it's enabled via the spirv path explicitly. If there somehow is a way to get the BDA of a descriptor? which I'm not sure about.
I agree about the bounds checking and they're working on this, but it wouldn't surprise me if it snuck in there for other tools as well.

devshgraphicsprogramming · 2025-01-28T20:41:11Z

If there somehow is a way to get the BDA of a descriptor? which I'm not sure about.

From the land of validation layers, yes *

as long as you're not using EXT_descriptor_buffer or some other thing that can update descriptor set bindings of SSBOs without an API call

I haven't read the Descriptor Buffer spec, but it may be that the N byte opaque handles for an SSBO contain that BDA somewhere.

s-perron · 2025-02-05T19:01:18Z

I cannot find the original HLSL, so I'm assuming this is something like https://godbolt.org/.

From the DXC code generation perspective, there is not much we can do. If we want to reduce the number of stores, we could try to represent the byte address buffers an array of int4 instead of int. That could reduce the number of stores by a factor of 4. However, it would make the codegen more complicated. We have had a lot of bugs in this part of the code doing layout calculations.

Longer term (in Clang, but not DXC), I want to represent ByteAddressBuffers using untyped pointers. Then the store turns into a single store.

I'll leave this open in case someone wants to try to work on modifying the current representation. My team will not get to it.

This could happen in two steps:

Change the representation to int4 and modify the access chains that currently have an index n to have two indicies n/4 and n%4. This will not change the number of loads and stores, but should be functionally correct.
Add an optimization pass to spirv-opt that will group loads/stores covering an entire vector into a single load or store.

I would do it as an optimization pass because it will be easier to test. We want the FE to have as few code paths as possible.

Nielsbishere added bug Bug, regression, crash needs-triage Awaiting triage spirv Work related to SPIR-V labels Jan 27, 2025

github-project-automation bot added this to HLSL Triage Jan 27, 2025

damyanp moved this to For Google in HLSL Triage Jan 29, 2025

s-perron moved this from For Google to Triaged in HLSL Triage Feb 5, 2025

s-perron removed the needs-triage Awaiting triage label Feb 5, 2025

s-perron added this to HLSL Roadmap Feb 5, 2025

github-project-automation bot moved this to New in HLSL Roadmap Feb 5, 2025

s-perron added this to the Dormant milestone Feb 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPIR-V] Inefficient codegen when writing to RWByteAddressBuffer #7089

[SPIR-V] Inefficient codegen when writing to RWByteAddressBuffer #7089

Nielsbishere commented Jan 27, 2025

devshgraphicsprogramming commented Jan 28, 2025

Nielsbishere commented Jan 28, 2025

devshgraphicsprogramming commented Jan 28, 2025

s-perron commented Feb 5, 2025

[SPIR-V] Inefficient codegen when writing to RWByteAddressBuffer #7089

[SPIR-V] Inefficient codegen when writing to RWByteAddressBuffer #7089

Comments

Nielsbishere commented Jan 27, 2025

devshgraphicsprogramming commented Jan 28, 2025

Nielsbishere commented Jan 28, 2025

devshgraphicsprogramming commented Jan 28, 2025

s-perron commented Feb 5, 2025