Inefficient x64 codegen for fmin/fmax #186

abrown · 2020-02-06T19:44:02Z

In attempting to implement fmin and fmax, I observed that the semantics of these instructions prevents a single instruction lowering on x64. V8 has a 9-instruction lowering for F32x4Min, for example, and the other min/max implementations for F32x4/F64x2 are not better.

Also, I noticed that the V8 implementation quiets and clears the NaN payload; this behavior does not seem to be specified in the spec but I suspect that it is necessary for passing the spec tests. Is this correct?

The text was updated successfully, but these errors were encountered:

ngzhian · 2020-02-06T20:59:47Z

I think the key point from the proposal is "They are lane-wise versions of the existing scalar WebAssembly operations." This requires it to follow the semantics of scalar fmin.

abrown · 2020-02-06T21:14:04Z

I think the key point from the proposal is "They are lane-wise versions of the existing scalar WebAssembly operations."

I guess we could discuss if that has to be the case. Is that the reason behind quieting and clearing the NaN payload?

penzn · 2020-02-06T22:47:40Z

Yes, wasm propagates NaNs, that's why they have to be of "quiet" type.

dtig · 2020-02-18T19:44:23Z

This has been discussed a few times in various contexts, and there isn't much that can be done here that's actionable. It may be possible to whittle down the V8 codegen some more (if you know of something specific here, please file a V8 issue so we can address it), but the conclusion of most previous discussions here has been that we need fmin/fmax operations, and at least for the MVP SIMD, these have to be deterministic, and follow the same semantics of the scalar operations.

See also the discussion for Pseudo Min/Max operations in #122, as a potential follow up to this particular issue. I'm closing this for now as your question seems to be answered, feel free to reopen if there's still something outstanding to address that hasn't already been addressed here or in #122.

abrown · 2020-02-21T19:51:56Z

@penzn, could VFIXUPIMM* be used here to do this quieting/clearing?

abrown · 2020-02-21T21:01:41Z

@dtig, could we keep this open or tag this in some way so that we keep track of these inefficiencies? @rrwinterton had proposed an optimization guide for documenting pitfalls exactly like these--alternately we could just start recording issues like this one there?

lars-t-hansen · 2021-02-03T10:23:22Z

I agree with @abrown, we should reopen this since it fits in with the set of about 10 other issues we have on related topics.

I don't know if fmin/fmax with one constant operand is typical in code, but it does lend itself to better code and may be worth specializing.

penzn · 2021-10-23T00:46:55Z

After looking into this a little more, I think there is a bigger problem than NaNs, specifically comparisons of -0.0 to +0.0. There are known ways to deal with (consistently ignore or propagate) NaN values, but less so for 'sign of zero' situations. From what I can tell, this is taken from JS spec, while native standards often define sign of zero coming from min/max as implementation-dependent. I am not sure what is the practical use in enforcing sign of zero, since from equality point of view it is the same value.

Maratyszcza · 2021-10-23T05:51:48Z

VRANGEPS/VRANGEPD from AVX512 might be helpful, as their handling of signed zeroes is consistent with WAsm instructions.

penzn · 2021-10-26T00:52:26Z

Yeah, that is on my list as well - trying to investigate better min/max sequences, using AVX512.

penzn · 2021-12-07T03:00:30Z

(the following is an excerpt from meeting notes WebAssembly/relaxed-simd#47)

Floating-point min and max operations in WebAssembly have the following
constraints:

Signed zero handling (ex.: min(0.0, -0.0) == -0.0)
NaN propagation (if either input is NaN output is NaN)
NaN canonicalization

This makes x86-based implementations somewhat complicated, because SSE and AVX
instructions don't honor these properties. Solutions usually include:

cmpunordps to detect NaN inputs
and or or to handle signed zeros
Performing operation twice to detect asymmetry
Various forms of blending to put the result back together

The problem is that mitigations for the constraints above interact with each
other, making overall sequence more complicated. Removing even one of the
constraints will make the sequence much simpler. Using more registers may help
as well.

With AVX512:

vrange has the desired behavior v.r.t signed zeros
vfixupimm detects and corrects special values, but requires extra movs
and registers to set up fixup table
Masked operations

We have tried vrange and vrange with conditional lane operations, on a
MobileNet demo:

Both get close to theoretical maxiumum performance
vrange combined with conditional ops wins

Implementation challenges:

AVX512 detection and assembler changes
- cpuid, encoding, tests, etc
For conditional operations, mask register representation
- Possible to hardcode if this is the only use

dtig closed this as completed Feb 18, 2020

lars-t-hansen reopened this Feb 3, 2021

ngzhian added the perf documentation label Mar 17, 2021

penzn mentioned this issue Dec 2, 2021

SIMD subgroup meeting on 2021-12-03 WebAssembly/relaxed-simd#47

Closed

keithw mentioned this issue May 2, 2023

Pass WebAssembly testsuite simd-everywhere/simde#1010

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inefficient x64 codegen for fmin/fmax #186

Inefficient x64 codegen for fmin/fmax #186

abrown commented Feb 6, 2020

ngzhian commented Feb 6, 2020

abrown commented Feb 6, 2020

penzn commented Feb 6, 2020

dtig commented Feb 18, 2020

abrown commented Feb 21, 2020

abrown commented Feb 21, 2020

lars-t-hansen commented Feb 3, 2021

penzn commented Oct 23, 2021

Maratyszcza commented Oct 23, 2021

penzn commented Oct 26, 2021

penzn commented Dec 7, 2021

Inefficient x64 codegen for fmin/fmax #186

Inefficient x64 codegen for fmin/fmax #186

Comments

abrown commented Feb 6, 2020

ngzhian commented Feb 6, 2020

abrown commented Feb 6, 2020

penzn commented Feb 6, 2020

dtig commented Feb 18, 2020

abrown commented Feb 21, 2020

abrown commented Feb 21, 2020

lars-t-hansen commented Feb 3, 2021

penzn commented Oct 23, 2021

Maratyszcza commented Oct 23, 2021

penzn commented Oct 26, 2021

penzn commented Dec 7, 2021