Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

relaxed i32x4.trunc_sat_f32x4_{s,u} i32x4.trunc_sat_f64x2_{s,u}_zero #21

Open
ngzhian opened this issue Apr 16, 2021 · 20 comments
Open
Labels
in-overview Instruction has been added to Overview.md instruction-proposal

Comments

@ngzhian
Copy link
Member

ngzhian commented Apr 16, 2021

  1. What are the instructions being proposed?

Relaxed versions of:

  • i32x4.trunc_sat_f32x4_s
  • i32x4.trunc_sat_f32x4_u
  • i32x4.trunc_sat_f64x2_s_zero
  • i32x4.trunc_sat_f64x2_u_zero

from Simd128. (Names undecided)

  1. What are the semantics of these instructions?

Convert f32x4/f64x2 to i32x4 with truncation (signed/unsigned). If the inputs are out of range or NaNs, the result is implementation-defined.

  1. How will these instructions be implemented? Give examples for at least
    x86-64 and ARM64. Also provide reference implementation in terms of 128-bit
    Wasm SIMD.

x86/64

relaxed i32x4.trunc_sat_f32x4_s = CVTTPS2DQ
relaxed i32x4.trunc_sat_f32x4_u = VCVTTPS2UDQ (AVX512), Simd128 i32x4.trunc_sat_f32x4_u otherwise (can be slightly optimized to ignore NaNs)
relaxed i32x4.trunc_sat_f64x2_s_zero = CVTTPD2DQ
relaxed i32x4.trunc_sat_f64x2_u_zero = VCVTTPD2UDQ (AVX512), Simd128 i32x4.trunc_sat_f64x2_u_zero

ARM64

relaxed i32x4.trunc_sat_f32x4_s = FCVTZS
relaxed i32x4.trunc_sat_f32x4_u = FCVTZU
relaxed i32x4.trunc_sat_f64x2_s_zero = FCVTZS + SQXTN
relaxed i32x4.trunc_sat_f64x2_u_zero = FCVTZU + UQXTN

ARM NEON

relaxed i32x4.trunc_sat_f32x4_s = vcvt.S32.F32
relaxed i32x4.trunc_sat_f32x4_u = vcvt.U32.F32
relaxed i32x4.trunc_sat_f64x2_s_zero = vcvt.S32.F64 + vcvt.S32.F64 + vmov
relaxed i32x4.trunc_sat_f64x2_u_zero = vcvt.U32.F64 + vcvt.U32.F64 + vmov

Note: On ARM MVE, double precision conversions require Armv8-M Floating-point Extension (FPv5), MVE can be implemented with or without such an extension.

simd128

respective non-relaxed versions i32x4.trunc_sat_f32x4_s, i32x4.trunc_sat_f32x4_u, i32x4.trunc_sat_f64x2_s_zero, i32x4.trunc_sat_f64x2_u_zero.

  1. How does behavior differ across processors? What new fingerprinting surfaces will be exposed?

For i32x4.trunc_sat_f32x4_s:

  • x86/64 will return 0x8000000 in lanes for out of range or NaNs
  • ARM/ARM64 will return 0 for NaNs and saturated results of out of range

For i32x4.trunc_sat_f32x4_u:

  • x86/64 will return 0xFFFFFFFF in lanes for out of range or NaNs, if if AVX512 is available, 0 otherwise (but require more instruction counts)
  • ARM/ARM64 will return 0 for NaNs and saturated results of out of range

For i32x4.trunc_sat_f64x2_s_zero:

  • x86/64, 0x80000000 for out of range or NaNs
  • ARM/ARM64 will return 0 for NaNs and saturated results of out of range

For i32x4.trunc_sat_f64x2_u_zero:

  • x86/64, 0xFFFFFFFF for out of range or NaNs if AVX512 is available, 0 otherwise
  • ARM/ARM64 will return 0 for NaNs and saturated results of out of range
  1. What use cases are there?

Conversion instructions are common, if the application can guarantee the input range we can get good performance on all architectures.

@Maratyszcza
Copy link
Collaborator

IIRC @zeux had a use-case for these instructions.

It would be useful to consider f64x2 variants in the same proposal.

@Maratyszcza
Copy link
Collaborator

For i32x4.trunc_sat_f32x4_u, it will depend on implementation choice on x86/64:

  • if AVX512 is available, same as above, x86/64 will return 0x8000000 in lanes for out of range or NaNs, ARM/ARM64 will return 0

AVX512 version returns 0xFFFFFFFF

@ngzhian
Copy link
Member Author

ngzhian commented Apr 19, 2021

AVX512 version returns 0xFFFFFFFF

Corrected, thanks!

@ngzhian
Copy link
Member Author

ngzhian commented Apr 19, 2021

It would be useful to consider f64x2 variants in the same proposal.

relaxed i64x2.trunc_sat_f64x2_{s,u}? We don't have these instructions in Simd128, so I think it is neater to separate them out.

@Maratyszcza
Copy link
Collaborator

relaxed i64x2.trunc_sat_f64x2_{s,u}? We don't have these instructions in Simd128, so I think it is neater to separate them out.

The WebAssembly/simd#383 instructions

@ngzhian
Copy link
Member Author

ngzhian commented Apr 19, 2021

i32x4.trunc_sat_f64x2_u_zero and i32x4.trunc_sat_f64x2_s_zero?

@Maratyszcza
Copy link
Collaborator

Yes

@zeux
Copy link

zeux commented Apr 19, 2021

Yeah this one is pretty fundamental for many workflows, e.g. in rendering domains it's common to store data as fixed-point integers for GPU consumption but to prepare this data you do some math in floating point and then convert to integer via smth like int(v * 65535.0f + 0.5f) (assuming the value is known to be positive); the float->int truncation can be pretty hot based on the amount of other computation.

It would be nice to also include the rounding variants (on x64 assuming default rounding mode setup you can use cvtps2dq for rounding conversion and cvttps2dq for truncating; unsure what floating point environment is typically used in browser context, if it's undefined then rounding would require vroundps before cvttps).

@ngzhian ngzhian changed the title relaxed i32x4.trunc_sat_f32x4_{s,u} relaxed i32x4.trunc_sat_f32x4_{s,u} i32x4.trunc_sat_f64x2_{s,u}_zero Apr 19, 2021
ngzhian added a commit to ngzhian/relaxed-simd that referenced this issue Jun 10, 2021
ngzhian added a commit that referenced this issue Jun 24, 2021
@yurydelendik
Copy link
Contributor

What will be the exact recipe for relaxed i32x4.trunc_sat_f32x4_u for x86/64 without AVX512? The comment at #247 suggests somewhat long version.

Is the following acceptable or the shorter version exists?

  • y = relaxed i32x4.trunc_sat_f32x4_u(x) is lowered to:
    • MOVAPD xmm_y, xmm_x
    • MOVAPD xmm_tmp, [wasm_i32x4_splat(0x4f000000)]
    • CMPLTPS xmm_tmp, xmm_x
    • PAND xmm_tmp, xmm_x
    • PXOR xmm_y, xmm_tmp
    • CVTTPS2DQ xmm_y, xmm_y
    • PSLLD xmm_tmp, 7
    • PADDD xmm_y, xmm_tmp

@ngzhian
Copy link
Member Author

ngzhian commented Sep 28, 2021

it will be CVTTPS2DQ. The relaxed version only guarantees output when inputs are < INT32_MAX and not NaN, which is exactly what CVTTPS2DQ is, which is available since SSE2.

@Maratyszcza
Copy link
Collaborator

@ngzhian The question was about the unsigned version, and IIUC we don't expect unsigned version to use just CVTTPS2DQ alone.

@ngzhian
Copy link
Member Author

ngzhian commented Sep 28, 2021

Oh oops, sorry I missed that. Hm, then we should reconsider if we want the unsigned version in this. AVX512 is not supported by V8 yet.

@Maratyszcza
Copy link
Collaborator

IMO it is worth to have unsigned version, both for symmetry and because is it still faster on SSE4.1 than the non-relaxed unsigned version.

@zeux
Copy link

zeux commented Oct 1, 2021

Should these instructions have _sat in the name? In the SIMD MVP _sat stands for saturating, but these instructions don't specify exact behavior for out of range inputs.

@ngzhian
Copy link
Member Author

ngzhian commented Oct 14, 2021

What is PSLLD xmm_tmp, 7 for? I think it doesn't work for all cases, consider the input 2147483904.0, this is larger that MAX_INT32, but fits int UINT32, so the result should be 2147483904, or 0x80000100
The hex representation of 2147483904.0 is https://float.exposed/0x4f000001 and if we shift left by 7 it becomes 0x80000080, which is wrong.

@yurydelendik
Copy link
Contributor

yurydelendik commented Oct 15, 2021

Agree, there was a mistake 😞 One more operation is needed to make PSLLD work: ADDPS xmm_tmp. xmm_tmp ; PSLLD xmm_tmp, 8.

@ngzhian
Copy link
Member Author

ngzhian commented Oct 15, 2021

Agree, there was a mistake 😞 One more operation is needed to make PSLLD work: ADDPS xmm_tmp. xmm_tmp ; PSLLD xmm_tmp, 8.

Perfect, what a neat trick :) thanks!

@ngzhian
Copy link
Member Author

ngzhian commented Nov 1, 2021

Note: RISC-V V saturates for same width conversions. For f64x2->i32x4 it changes the vector type, and I think there's no guarantee that the top are zeroed.

@ngzhian
Copy link
Member Author

ngzhian commented Nov 1, 2021

On PowerPC VSX xscvdpsxws and xscvdpuxds perform trunc sat

@ngzhian ngzhian added the in-overview Instruction has been added to Overview.md label Feb 18, 2022
@ngzhian
Copy link
Member Author

ngzhian commented Mar 14, 2022

I think I got the out of range results wrong in this description, ARM/ARM64 doesn't return 0, it saturates.

kangwoosukeq pushed a commit to prosyslab/v8 that referenced this issue Apr 28, 2022
Codegen details detailed in the relevant github issue.
WebAssembly/relaxed-simd#21

Bug: v8:12284
Change-Id: I06c8859035abae775269bdf949ff0f1c2e262859
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/3508560
Reviewed-by: Adam Klein <[email protected]>
Commit-Queue: Deepti Gandluri <[email protected]>
Cr-Commit-Position: refs/heads/main@{#79410}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
in-overview Instruction has been added to Overview.md instruction-proposal
Projects
None yet
Development

No branches or pull requests

4 participants