Skip to content
This repository was archived by the owner on Dec 22, 2021. It is now read-only.

Inefficient x64 codegen for splat #191

Open
abrown opened this issue Feb 6, 2020 · 7 comments
Open

Inefficient x64 codegen for splat #191

abrown opened this issue Feb 6, 2020 · 7 comments

Comments

@abrown
Copy link
Contributor

abrown commented Feb 6, 2020

splat has 2- to 3-instruction lowerings in cranelift and v8. I believe the "splat all ones" and "splat all zeroes" cases are a single-instruction lowering in both platforms but it is unfortunate that other values of splat will incur a multi-instruction overhead, especially since splat would seem to be a high-use instruction.

@tlively
Copy link
Member

tlively commented Feb 14, 2020

Since splat is a high-use instruction, is there a different semantics that would cover most of its uses and also have better codegen? Or would simplifying the codegen for splats just lead to proportionally more complex user code to regain their current functionality?

@dtig
Copy link
Member

dtig commented Feb 18, 2020

This is very specifically an Intel ISA quirk because pshufd/pshufw/pshufb all have different semantics. In the specific case that you linked for i16x8.splat, the pshufw instruction only operates on 64-bit operands and not 128-bit operands, so there are a few different ways to synthesize this, but ASFAIK they will all need atleast two instructions to synthesize the splat - there are different ways of doing this apart from the V8 implementation linked, but I suspect you would be looking at some combination of pshuflw, pshufhw and/or pshufd, and possibly a move to an xmm register depending on engine implementation.

Is there something specific you would like to propose to mitigate this apart from getting rid of a high value operation? If not, and this is more highlighting a code generation issue - I'm not sure anything can actually be done about it given the different semantics for different bit widths on x64.

@abrown
Copy link
Contributor Author

abrown commented Feb 21, 2020

This is very specifically an Intel ISA quirk because pshufd/pshufw/pshufb all have different semantics

I don't think the key is actually the different semantics, it's that these instructions can't address scalar registers directly and are forced to MOV or PINSR* first to get the value in a vector register in order to then shuffle. I have been looking around at VBROADCAST, VPERM*, VSHUFF*, etc. but I don't see a way to address a GPR directly as in ARM. I suspect that this is impossible but perhaps there is some trick that I'm not yet aware of.

@dtig
Copy link
Member

dtig commented Feb 21, 2020

The different semantics are an issue for the specific i16x8.splat that you linked code to, but I agree that the additional mov/pinsr* instruction for splats is harder to get rid of even for memory operands because both the instructions don't load/insert into XMM registers from memory. and the same applies to the load+splat instructions as well (for pre-AVX* codegen).

@dtig
Copy link
Member

dtig commented May 20, 2020

There doesn't seem to be anything actionable here, so closing this issue - please reopen if you have suggestions for more we can do here.

@dtig dtig closed this as completed May 20, 2020
@abrown
Copy link
Contributor Author

abrown commented May 22, 2020

Can I get permissions to re-open this? I think the actionable part is to document the possible lowerings that improve the situation in the "implementor's guide" document (do we have one yet?). Specifically on x86, this high-use instruction can be:

  • reduced to two instructions with MOVD + V[P]BROADCAST* in AVX2
  • reduced to a single instruction with V[P]BROADCAST* in AVX2 when the value to splat can be determined to be from a load operation
  • reduced to a single instruction with V[P]BROADCAST* in various flavors of AVX512

@dtig
Copy link
Member

dtig commented May 22, 2020

Not sure if you need permissions to reopen as the original author for the issue, but reopening. This was previously discussed at a meeting (03/06), and there was an AI for Intel folks who were discussing this at the meeting to follow up with PRs/Issues to decide where this document should live, and what form it should take.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants