-
Notifications
You must be signed in to change notification settings - Fork 43
Inefficient x64 codegen for bitselect #192
Comments
I guess the bit-select semantics were chosen because ARM does it that way? The input to bitselect should be a "mask" with all bits either zero or 1. That's typically true anyway (result of comparison instruction), and anyone who has a high-bit-only mask (0x8..0) can right-shift to get the all-bits mask. Given that requirement, x86 blendv does the right thing - checking the upper bit is sufficient. The operation can lower to a single instruction. |
@jan-wassenberg If I understand correctly, the proposed semantics is the same as the current bitselect semantics except that it requires that all mask bits for a particular lane have the same value. Is that correct? If so, would a malformed mask result in a trap? I expect having to validate the mask would lead to similarly suboptimal codegen. |
@tlively Right. I agree validating inside bitselect would be expensive. Is undefined behavior generally unacceptable? If so, we could define a Mask type guaranteed by the compiler to have that property. Comparisons would return that type, and we could provide mask<=>simd128 conversions that do validation. |
Let's see if I understand:
|
Yes, WebAssembly cannot have undefined or implementation-defined behavior. Introducing a new mask type could solve the problem, but it would be a large amount of complexity to add to the current proposal. I propose that we keep the current bitselect instruction and defer discussion of a more complex version to a post-MVP SIMD proposal. |
@abrown Yes, that's what I was thinking. Valid could mean all_true((mask == 0) | mask). @tlively I totally understand that this would be a large change with relatively low ROI. It's still worth keeping in mind that masks would ideally be a separate type, if long vectors should ever be extended to AVX-512. (There, masks are single-bit, and it would be inefficient to expand them to vectors after every comparison.) |
I totally agree that having a separate mask type would make sense in a follow up SIMD proposal 👍 |
I understand the complexity argument but I think this would be far easier to do now than later. Once the current semantics for |
Because it would be a lot of work and we are trying to get SIMD into the hands of our users soon. Creating a new bitselect instruction in the future will be fine. A precedent for that kind of addition is the nontrapping float-to-int instructions, which were added post-MVP when we realized the MVP trapping float-to-int instructions were not as useful. To be clear, I like this idea and would be happy to do it in the future, but I don't think we should delay shipping MVP SIMD to make this change. |
@abrown Having different semantics here that can map to one instruction for bitselect either mean introducing a new separate mask type, or validating the mask. The code sequence here for x64 is fairly straightforward, and in both cases uses 1-cycle instructions (we could optimize V8 codegen here to get rid of the |
Very often the mask in bitselect comes from a comparison instruction. In this case x64 code generator can elide the extra instructions and just use blendv directly. On SKX blendv is 1 uop and has throughput of 1, vs and/andnot/or sequence that is 3 uops, but each insn has throughput of 3/clock so it's unclear to me which one is faster. Could be worth an experiment on the engine side though, or is it too much work to trace the "origin" of the value that feeds into bitselect? This is loosely equivalent to a separate mask type, but doesn't require spec changes. |
Even if the instructions dispatch in the same cycle with and/andnot/or, they will not retire in the same cycle due to the dependencies. Depending on the surrounding code, it's possible that they will reorder and the extra latency will not be measurable but they do end up using more registers as well and they do make the function longer. Both of these factors can impact whether a function will be inlined or not. And/andnot/or often can execute on more pipes than blendv although that isn't always the case. For example, on Ryzen logical operations can use pipe 0, 1, 2, or 3 while blendv can use pipe 0 or 1. I think any benchmark we can cook up to measure this will be synthetic in nature and might not reflect the real world., There might not be an always optimal choice. IMO either way is fine but if I had to pick one, I would opt for blendv. It's what I use my math lib as well. |
@zeux In V8, we could trace the origin, but most Wasm engines by design are more simplistic than that, i.e. the operations are expected to map directly to code generation without the need for more optimizations. Though we could argue that we pattern match for shuffles so we could justify adding more optimizations there. I'm just not sure that this particular case warrants that optimization. @nfrechette Synthetic benchmarks don't represent real world data, but what I was proposing more is that if there is a need to change the semantics, I think a concrete proposal to prototype/ benchmark is required. I do agree that this particular case performance either way is very application dependent, and either way doesn't necessarily outweigh the other significantly, so I'm unwilling to put in too much effort into this particular operation. |
@dtig I guess my point was that I don't think we need to change the wasm spec here since the opportunity for better codegen is left open. If an engine does any sort of stack->SSA conversion then this tracking is probably simple; if an engine doesn't do that it's going to miss out on a lot of other opportunities such as shifts. |
@zeux I agree that we don't need a spec change here, but not necessarily for the same reasons. My earlier comment was trying to clarify that there were some speculative spec changes discussed earlier with different performance tradeoffs, and none of those seemed significantly better than the other. It was more of a suggestion for @abrown to have a concrete proposal so we can discuss actual tradeoffs or prototype if necessary. Re. tracking, I think that we could do it in the future as an additional optimization, but wanted to point out that for wasm in general it's been explicit design choice not to do so, and most engines make the same choice. |
I do not believe the current shift situation is truly viable without tracking constantness of arguments. On both ARM and x64 shifts take a lot of instructions, up to 9-10, if the engine doesn’t know these are constants. Does this mean we need to introduce constant versions of all shifts, or is that different? Raising this because I was under the impression that all engines will do this but I am unsure now. |
I think I misunderstood the point you're trying to make with tracking the origin - tracking constantness, and tracking the origin to me mean different things. Deducing that the origin is a constant, and making optimizations on constantness is a necessary optimization, which most engines will do (32/64-bit constants are obviously an easier case), separately tracing the origin of the value that feeds into the bitselect, which is also possible if the origin is a compare and to optimize by merging codegen for these, is what I meant in my previous comments when I meant this would be an additional optimization. |
I guess I thought that these are the same because for both you need to identify the instruction that pushed the value to the stack slot - in one case it’s a constant integer, and in another case it’s a comparison instruction. Which is why I thought that either it’s reasonable to expect to be able to do both, or neither. |
@dtig, I don't feel there is much of a need for making Jan's proposal above more concrete; after clarifying above, it is clear enough to me what would need to happen for us to make this instruction more performant. What isn't clear to me is how these semantics were chosen: is it really as mentioned above "because ARM does it that way?" I think that's the type of thing that needs to be made more concrete. @zeux, I added a cranelift issue along the lines of what you proposed. Thanks! I think that is the way forward in the short term and I can piece together a vision of how that could work in the current cranelift infrastructure. I don't think we should completely abandon @jan-wassenberg's mask idea (it's a great idea, BTW!) because it will be necessary for applying this optimization through function calls. And @jan-wassenberg, do you see this "new type" approach being applicable anywhere else? |
@zeux I think you could do both, it's a higher level optimization that we disable for Wasm codegen because in general Wasm operations should map very closely to native instructions. Constants at least in V8 are special, and that knowledge is built into input nodes at all levels - so deducing that it's a constant is easier than deducing which operation generates it. This is more an engine implementation detail, but I see how they can both be interchangeable. @abrown The original semantics came from the time when the SIMD.js proposal was transitioned over to the the one for SIMD in Wasm, and this was before my time. My speculative answer though is that these were chosen as a good starting point, and application feedback would influence tweaking semantics. ASFAIK, this hasn't been problematic for applications with the prototype implementation in V8 being in use for some time. If you are looking for more specific reasons for initial semantics, I would redirect you to @PeterJensen and @arunetm who were involved in the SIMD.js specification. |
Thanks @abrown :) I suppose a type could also be useful for swizzle() inputs that are known to be >= 0x80, which would allow using PSHUFB directly. Probably small potatoes, though - much less common/important than 0xFF..FF masks. |
What about NaNs? I'm still trying to think this through and it still is very hazy, but would a new type guaranteeing that a float isn't a NaN be possible? If I saw that type I could probably avoid some extra instructions. |
Brief summary of the history of bitselect: SIMD.js started with a bit-select, and later added an element-wise select, and later removed bit-select to simplify the initial proposal (with the expectation of re-adding it in the future). With the switch to Wasm SIMD, SIMD.js' types were removed in favor of a single @abrown Guaranteeing that a float isn't a NaN is hard, because every arithmetic operator has a case a NaN is produced from non-NaN inputs: |
@abrown Is there more that you were looking for here, or can this be closed? |
I think the closure for this should be documentation in an "implementation guide." |
bitselect
is a 3-instruction lowering in cranelift and a 4-instruction lowering in v8.The text was updated successfully, but these errors were encountered: