Skip to content
This repository was archived by the owner on Dec 22, 2021. It is now read-only.

Inefficient x64 codegen for all_true/any_true #189

Open
abrown opened this issue Feb 6, 2020 · 12 comments
Open

Inefficient x64 codegen for all_true/any_true #189

abrown opened this issue Feb 6, 2020 · 12 comments

Comments

@abrown
Copy link
Contributor

abrown commented Feb 6, 2020

all_true checks if all lanes are (unsigned) greater than 0. This requires 4 instructions in cranelift and 6 in v8. Perhaps there is a more granular way to reduce lanes (see movemask) and avoid this inefficiency?

Along these lines, any_true is 4 instructions in v8 and could be 2 as in cranelift with the use of SETcc.

@abrown abrown changed the title Inefficient x64 codegen for all_true Inefficient x64 codegen for all_true/any_true Feb 6, 2020
pull bot pushed a commit to p-g-krish/v8 that referenced this issue Mar 13, 2020
Based on feedback in WebAssembly/simd#189 and
inspired by cranelift's codegen, we reduce instruction count by 1 for
both types of operations - all_true goes from 6 -> 5, any_true from 4 ->
3. The main transformation is to change a sequence of movq + ptest +
cmovq to ptest + setcc. We unfortunately cannot cut down the instruction
counts further, since we need to zero the destination register.

Change-Id: Idc2540dbec755c7a7ff5069955f74e978190161d
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2100994
Reviewed-by: Deepti Gandluri <[email protected]>
Commit-Queue: Zhi An Ng <[email protected]>
Cr-Commit-Position: refs/heads/master@{#66710}
@ngzhian
Copy link
Member

ngzhian commented Oct 8, 2020

setcc will only set the low byte of the register, how does cranelift deal with this? Is the dst register zero-ed first?

@abrown
Copy link
Contributor Author

abrown commented Oct 8, 2020

Cranelift's SSA form has types for each value; if you try to use an I8 value somewhere as an I64 then it will extend it appropriately. (That is my hazy recollection from when I implemented it a long, long time ago).

@abrown
Copy link
Contributor Author

abrown commented Oct 8, 2020

But looking at the V8 implementation: why not XOR the tmp register as is done currently and use setcc?

@ngzhian
Copy link
Member

ngzhian commented Oct 8, 2020

it will extend it appropriately.

Got it, thanks! So there's one more instruction for extension. In v8 I used xor, which I guess is the same in terms of instruction counts.

I think you're looking at the older version, the latest one uses setcc https://github.com/v8/v8/blob/master/src/compiler/backend/x64/code-generator-x64.cc#L604-L613

I couldn't do the same on IA-32, setcc there requires a byte register, and I don't think we can specify that we want a byte register in our register allocation, can only check it. I looked briefly at cranelift, it looks like it only supports x86-64, so this is not a concern there, right?

@abrown
Copy link
Contributor Author

abrown commented Oct 8, 2020

I wasn't too focused on IA-32. Thanks for the link to the latest; I guess opening some of these issues had some effect after all! I have been waiting for a document to put these implementation notes in... @ngzhian, @tlively, what do you think about adding them to the SIMD document itself but as collapsible sections, like:

Implementation notes:
x86/x86-64 processors with AVX instruction sets

TODO

x86/x86-64 processors with SSE4.1 instruction sets

TODO

ARM64 processors

TODO

That way both the implementors and the users can quickly see lowerings. I would assume users might actually want a hint as to what these instructions will compile down to, even if different runtimes actually emit different things.

@ngzhian
Copy link
Member

ngzhian commented Oct 8, 2020

I guess opening some of these issues had some effect after all!

Definitely, thank you for doing so. A lot of the issues don't have workarounds, that's how the instructions are spec-ed and how we need to implement them.

I don't remember what these "implementation notes" are for? Is it guidance for engine implementors to make sure their implementations are fast?

@abrown
Copy link
Contributor Author

abrown commented Oct 8, 2020

Yeah, that and to alert users that the semantics of certain instructions may have unintended codegen; maybe you use a Wasm SIMD instruction in your performance-critical code and then you realize that it doesn't perform as well as you thought on certain architectures. I think it would be good to be upfront about that rather than having to dig through v8 source code, e.g., as I have had to do.

@ngzhian
Copy link
Member

ngzhian commented Oct 8, 2020

I think it would be good to be upfront about that rather than having to dig through v8 source code, e.g., as I have had to do.

@zeux has this https://github.com/zeux/wasm-simd/ which targets part of it. I think it's an appropriate place for such notes, provided we keep it up to date :) I don't think we should do that work right now, since the instruction set is still in flux. After we finalize (hard freeze), we can probably invest more time into it.

@abrown
Copy link
Contributor Author

abrown commented Oct 8, 2020

Yeah, in previous discussions I've mentioned that I wanted that information to be more visible and official. I'm not opposed to waiting longer, though (less work right now 😄).

@ngzhian
Copy link
Member

ngzhian commented Oct 8, 2020

Yeah, in previous discussions I've mentioned that I wanted that information to be more visible and official.

Ah okay, sorry I've lost track of this. Perhaps during that discussion it was brought up that the spec might not be the right place for this sort of information? Maybe it can be in the Appendix, which is the only place containing implementation-related information.

I'm not opposed to waiting longer

Same, let's leave this as it is for now :)

@abrown
Copy link
Contributor Author

abrown commented Oct 8, 2020

Yeah, that appendix seems like the right type of thing--in the same general place as the specification but not cluttering the semantics.

@Maratyszcza
Copy link
Contributor

Maratyszcza commented Oct 8, 2020

why not XOR the tmp register as is done currently and use setcc?

Partial register updates are very expensive on Intel. It would be more efficient to do setcc r8 + movzx r32, r8, even though it increase latency of the dependency chain.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants