Inefficient x64 codegen for splat #191

abrown · 2020-02-06T22:20:59Z

splat has 2- to 3-instruction lowerings in cranelift and v8. I believe the "splat all ones" and "splat all zeroes" cases are a single-instruction lowering in both platforms but it is unfortunate that other values of splat will incur a multi-instruction overhead, especially since splat would seem to be a high-use instruction.

The text was updated successfully, but these errors were encountered:

tlively · 2020-02-14T01:31:40Z

Since splat is a high-use instruction, is there a different semantics that would cover most of its uses and also have better codegen? Or would simplifying the codegen for splats just lead to proportionally more complex user code to regain their current functionality?

dtig · 2020-02-18T23:03:22Z

This is very specifically an Intel ISA quirk because pshufd/pshufw/pshufb all have different semantics. In the specific case that you linked for i16x8.splat, the pshufw instruction only operates on 64-bit operands and not 128-bit operands, so there are a few different ways to synthesize this, but ASFAIK they will all need atleast two instructions to synthesize the splat - there are different ways of doing this apart from the V8 implementation linked, but I suspect you would be looking at some combination of pshuflw, pshufhw and/or pshufd, and possibly a move to an xmm register depending on engine implementation.

Is there something specific you would like to propose to mitigate this apart from getting rid of a high value operation? If not, and this is more highlighting a code generation issue - I'm not sure anything can actually be done about it given the different semantics for different bit widths on x64.

abrown · 2020-02-21T21:36:35Z

This is very specifically an Intel ISA quirk because pshufd/pshufw/pshufb all have different semantics

I don't think the key is actually the different semantics, it's that these instructions can't address scalar registers directly and are forced to MOV or PINSR* first to get the value in a vector register in order to then shuffle. I have been looking around at VBROADCAST, VPERM*, VSHUFF*, etc. but I don't see a way to address a GPR directly as in ARM. I suspect that this is impossible but perhaps there is some trick that I'm not yet aware of.

dtig · 2020-02-21T23:09:13Z

The different semantics are an issue for the specific i16x8.splat that you linked code to, but I agree that the additional mov/pinsr* instruction for splats is harder to get rid of even for memory operands because both the instructions don't load/insert into XMM registers from memory. and the same applies to the load+splat instructions as well (for pre-AVX* codegen).

dtig · 2020-05-20T23:47:48Z

There doesn't seem to be anything actionable here, so closing this issue - please reopen if you have suggestions for more we can do here.

abrown · 2020-05-22T04:34:22Z

Can I get permissions to re-open this? I think the actionable part is to document the possible lowerings that improve the situation in the "implementor's guide" document (do we have one yet?). Specifically on x86, this high-use instruction can be:

reduced to two instructions with MOVD + V[P]BROADCAST* in AVX2
reduced to a single instruction with V[P]BROADCAST* in AVX2 when the value to splat can be determined to be from a load operation
reduced to a single instruction with V[P]BROADCAST* in various flavors of AVX512

dtig · 2020-05-22T04:44:24Z

Not sure if you need permissions to reopen as the original author for the issue, but reopening. This was previously discussed at a meeting (03/06), and there was an AI for Intel folks who were discussing this at the meeting to follow up with PRs/Issues to decide where this document should live, and what form it should take.

dtig closed this as completed May 20, 2020

dtig reopened this May 22, 2020

ngzhian added the perf documentation label Mar 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inefficient x64 codegen for splat #191

Inefficient x64 codegen for splat #191

abrown commented Feb 6, 2020

tlively commented Feb 14, 2020

dtig commented Feb 18, 2020

abrown commented Feb 21, 2020

dtig commented Feb 21, 2020

dtig commented May 20, 2020

abrown commented May 22, 2020

dtig commented May 22, 2020

Inefficient x64 codegen for splat #191

Inefficient x64 codegen for splat #191

Comments

abrown commented Feb 6, 2020

tlively commented Feb 14, 2020

dtig commented Feb 18, 2020

abrown commented Feb 21, 2020

dtig commented Feb 21, 2020

dtig commented May 20, 2020

abrown commented May 22, 2020

dtig commented May 22, 2020