-
Notifications
You must be signed in to change notification settings - Fork 43
Inefficient x64 codegen for all_true/any_true #189
Comments
Based on feedback in WebAssembly/simd#189 and inspired by cranelift's codegen, we reduce instruction count by 1 for both types of operations - all_true goes from 6 -> 5, any_true from 4 -> 3. The main transformation is to change a sequence of movq + ptest + cmovq to ptest + setcc. We unfortunately cannot cut down the instruction counts further, since we need to zero the destination register. Change-Id: Idc2540dbec755c7a7ff5069955f74e978190161d Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2100994 Reviewed-by: Deepti Gandluri <[email protected]> Commit-Queue: Zhi An Ng <[email protected]> Cr-Commit-Position: refs/heads/master@{#66710}
setcc will only set the low byte of the register, how does cranelift deal with this? Is the dst register zero-ed first? |
Cranelift's SSA form has types for each value; if you try to use an |
But looking at the V8 implementation: why not XOR the tmp register as is done currently and use setcc? |
Got it, thanks! So there's one more instruction for extension. In v8 I used xor, which I guess is the same in terms of instruction counts. I think you're looking at the older version, the latest one uses setcc https://github.com/v8/v8/blob/master/src/compiler/backend/x64/code-generator-x64.cc#L604-L613 I couldn't do the same on IA-32, setcc there requires a byte register, and I don't think we can specify that we want a byte register in our register allocation, can only check it. I looked briefly at cranelift, it looks like it only supports x86-64, so this is not a concern there, right? |
I wasn't too focused on IA-32. Thanks for the link to the latest; I guess opening some of these issues had some effect after all! I have been waiting for a document to put these implementation notes in... @ngzhian, @tlively, what do you think about adding them to the SIMD document itself but as collapsible sections, like: Implementation notes:x86/x86-64 processors with AVX instruction setsTODO x86/x86-64 processors with SSE4.1 instruction setsTODO ARM64 processorsTODO That way both the implementors and the users can quickly see lowerings. I would assume users might actually want a hint as to what these instructions will compile down to, even if different runtimes actually emit different things. |
Definitely, thank you for doing so. A lot of the issues don't have workarounds, that's how the instructions are spec-ed and how we need to implement them. I don't remember what these "implementation notes" are for? Is it guidance for engine implementors to make sure their implementations are fast? |
Yeah, that and to alert users that the semantics of certain instructions may have unintended codegen; maybe you use a Wasm SIMD instruction in your performance-critical code and then you realize that it doesn't perform as well as you thought on certain architectures. I think it would be good to be upfront about that rather than having to dig through v8 source code, e.g., as I have had to do. |
@zeux has this https://github.com/zeux/wasm-simd/ which targets part of it. I think it's an appropriate place for such notes, provided we keep it up to date :) I don't think we should do that work right now, since the instruction set is still in flux. After we finalize (hard freeze), we can probably invest more time into it. |
Yeah, in previous discussions I've mentioned that I wanted that information to be more visible and official. I'm not opposed to waiting longer, though (less work right now 😄). |
Ah okay, sorry I've lost track of this. Perhaps during that discussion it was brought up that the spec might not be the right place for this sort of information? Maybe it can be in the Appendix, which is the only place containing implementation-related information.
Same, let's leave this as it is for now :) |
Yeah, that appendix seems like the right type of thing--in the same general place as the specification but not cluttering the semantics. |
Partial register updates are very expensive on Intel. It would be more efficient to do |
all_true
checks if all lanes are (unsigned) greater than 0. This requires 4 instructions in cranelift and 6 in v8. Perhaps there is a more granular way to reduce lanes (see movemask) and avoid this inefficiency?Along these lines,
any_true
is 4 instructions in v8 and could be 2 as in cranelift with the use ofSETcc
.The text was updated successfully, but these errors were encountered: