Inefficient x64 codegen for all_true/any_true #189

abrown · 2020-02-06T21:40:42Z

all_true checks if all lanes are (unsigned) greater than 0. This requires 4 instructions in cranelift and 6 in v8. Perhaps there is a more granular way to reduce lanes (see movemask) and avoid this inefficiency?

Along these lines, any_true is 4 instructions in v8 and could be 2 as in cranelift with the use of SETcc.

The text was updated successfully, but these errors were encountered:

Based on feedback in WebAssembly/simd#189 and inspired by cranelift's codegen, we reduce instruction count by 1 for both types of operations - all_true goes from 6 -> 5, any_true from 4 -> 3. The main transformation is to change a sequence of movq + ptest + cmovq to ptest + setcc. We unfortunately cannot cut down the instruction counts further, since we need to zero the destination register. Change-Id: Idc2540dbec755c7a7ff5069955f74e978190161d Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2100994 Reviewed-by: Deepti Gandluri <[email protected]> Commit-Queue: Zhi An Ng <[email protected]> Cr-Commit-Position: refs/heads/master@{#66710}

ngzhian · 2020-10-08T01:25:28Z

setcc will only set the low byte of the register, how does cranelift deal with this? Is the dst register zero-ed first?

abrown · 2020-10-08T20:08:44Z

Cranelift's SSA form has types for each value; if you try to use an I8 value somewhere as an I64 then it will extend it appropriately. (That is my hazy recollection from when I implemented it a long, long time ago).

abrown · 2020-10-08T20:10:59Z

But looking at the V8 implementation: why not XOR the tmp register as is done currently and use setcc?

ngzhian · 2020-10-08T20:25:35Z

it will extend it appropriately.

Got it, thanks! So there's one more instruction for extension. In v8 I used xor, which I guess is the same in terms of instruction counts.

I think you're looking at the older version, the latest one uses setcc https://github.com/v8/v8/blob/master/src/compiler/backend/x64/code-generator-x64.cc#L604-L613

I couldn't do the same on IA-32, setcc there requires a byte register, and I don't think we can specify that we want a byte register in our register allocation, can only check it. I looked briefly at cranelift, it looks like it only supports x86-64, so this is not a concern there, right?

abrown · 2020-10-08T20:49:41Z

I wasn't too focused on IA-32. Thanks for the link to the latest; I guess opening some of these issues had some effect after all! I have been waiting for a document to put these implementation notes in... @ngzhian, @tlively, what do you think about adding them to the SIMD document itself but as collapsible sections, like:

Implementation notes:

x86/x86-64 processors with AVX instruction sets

TODO

x86/x86-64 processors with SSE4.1 instruction sets

TODO

ARM64 processors

TODO

That way both the implementors and the users can quickly see lowerings. I would assume users might actually want a hint as to what these instructions will compile down to, even if different runtimes actually emit different things.

ngzhian · 2020-10-08T21:26:00Z

I guess opening some of these issues had some effect after all!

Definitely, thank you for doing so. A lot of the issues don't have workarounds, that's how the instructions are spec-ed and how we need to implement them.

I don't remember what these "implementation notes" are for? Is it guidance for engine implementors to make sure their implementations are fast?

abrown · 2020-10-08T21:46:45Z

Yeah, that and to alert users that the semantics of certain instructions may have unintended codegen; maybe you use a Wasm SIMD instruction in your performance-critical code and then you realize that it doesn't perform as well as you thought on certain architectures. I think it would be good to be upfront about that rather than having to dig through v8 source code, e.g., as I have had to do.

ngzhian · 2020-10-08T22:18:04Z

I think it would be good to be upfront about that rather than having to dig through v8 source code, e.g., as I have had to do.

@zeux has this https://github.com/zeux/wasm-simd/ which targets part of it. I think it's an appropriate place for such notes, provided we keep it up to date :) I don't think we should do that work right now, since the instruction set is still in flux. After we finalize (hard freeze), we can probably invest more time into it.

abrown · 2020-10-08T22:27:19Z

Yeah, in previous discussions I've mentioned that I wanted that information to be more visible and official. I'm not opposed to waiting longer, though (less work right now 😄).

ngzhian · 2020-10-08T22:31:32Z

Yeah, in previous discussions I've mentioned that I wanted that information to be more visible and official.

Ah okay, sorry I've lost track of this. Perhaps during that discussion it was brought up that the spec might not be the right place for this sort of information? Maybe it can be in the Appendix, which is the only place containing implementation-related information.

I'm not opposed to waiting longer

Same, let's leave this as it is for now :)

abrown · 2020-10-08T22:32:59Z

Yeah, that appendix seems like the right type of thing--in the same general place as the specification but not cluttering the semantics.

Maratyszcza · 2020-10-08T22:48:02Z

why not XOR the tmp register as is done currently and use setcc?

Partial register updates are very expensive on Intel. It would be more efficient to do setcc r8 + movzx r32, r8, even though it increase latency of the dependency chain.

abrown changed the title ~~Inefficient x64 codegen for all_true~~ Inefficient x64 codegen for all_true/any_true Feb 6, 2020

jlb6740 mentioned this issue Mar 11, 2020

Add .bitmask instruction family #201

Merged

arunetm mentioned this issue Nov 9, 2020

Agenda for sync meeting 11/13/20 #390

Closed

ngzhian added the perf documentation label Mar 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inefficient x64 codegen for all_true/any_true #189

Inefficient x64 codegen for all_true/any_true #189

abrown commented Feb 6, 2020 •

edited

Loading

ngzhian commented Oct 8, 2020

abrown commented Oct 8, 2020

abrown commented Oct 8, 2020

ngzhian commented Oct 8, 2020 •

edited

Loading

abrown commented Oct 8, 2020 •

edited

Loading

x86/x86-64 processors with AVX instruction sets

x86/x86-64 processors with SSE4.1 instruction sets

ARM64 processors

ngzhian commented Oct 8, 2020

abrown commented Oct 8, 2020

ngzhian commented Oct 8, 2020

abrown commented Oct 8, 2020

ngzhian commented Oct 8, 2020

abrown commented Oct 8, 2020

Maratyszcza commented Oct 8, 2020 •

edited

Loading

Inefficient x64 codegen for all_true/any_true #189

Inefficient x64 codegen for all_true/any_true #189

Comments

abrown commented Feb 6, 2020 • edited Loading

ngzhian commented Oct 8, 2020

abrown commented Oct 8, 2020

abrown commented Oct 8, 2020

ngzhian commented Oct 8, 2020 • edited Loading

abrown commented Oct 8, 2020 • edited Loading

x86/x86-64 processors with AVX instruction sets

x86/x86-64 processors with SSE4.1 instruction sets

ARM64 processors

ngzhian commented Oct 8, 2020

abrown commented Oct 8, 2020

ngzhian commented Oct 8, 2020

abrown commented Oct 8, 2020

ngzhian commented Oct 8, 2020

abrown commented Oct 8, 2020

Maratyszcza commented Oct 8, 2020 • edited Loading

abrown commented Feb 6, 2020 •

edited

Loading

ngzhian commented Oct 8, 2020 •

edited

Loading

abrown commented Oct 8, 2020 •

edited

Loading

Maratyszcza commented Oct 8, 2020 •

edited

Loading