Inefficient x64 codegen for conversion instructions #190

abrown · 2020-02-06T21:58:32Z

Certain SIMD conversions seem to have inefficient lowerings in x64. f32x4.convert_i32x4_u is lowered to 8 instruction by v8. The signed version, f32x4.convert_i32x4_s, on the other hand, is lowered to a single instruction.

I can't find the v8 implementation for i32x4.trunc_sat_f32x4_s and i32x4.trunc_sat_f32x4_u but I think the situation is the same: the signed version should have a single instruction lowering to CVTTPS2DQ and the unsigned version will require some longer sequence. [edit: this is incorrect, see #173 for a more correct discussion of this inefficiency]

The 64x2 versions of these instructions were dropped in #178. For similar reasons (@ngzhian: "because it is uncommon for such instructions to be used, and hardware support is not widespread"), should we remove the unsigned versions?

The text was updated successfully, but these errors were encountered:

zeux · 2020-02-07T07:06:11Z

the signed version should have a single instruction lowering to CVTTPS2DQ

This is not the case, see #173.

abrown · 2020-02-07T17:25:32Z

Ah, good point; I didn't read #173 carefully enough. Let me edit out that part of the issue above.

tlively · 2020-02-14T01:35:28Z

Is it uncommon for these instructions to be used? At least as a user if I'm trying to convert a 64x2 vector, I only have two lanes to manually convert. If we remove the 32x4 conversions, users would have to write 4x more code to convert each lane. That ergonomic loss makes it more important to have 32x4 conversions than 64x2 conversions even if they are similarly unpopular. My guess is that the 32x4 conversions would be more common, although I do no know for sure.

zeux · 2020-02-14T01:51:52Z

Note that 64-bit conversions were dropped because they didn’t have an efficient implementation on any architecture, short of AVX512. What does lowering of unsigned conversions look like for ARM/Power?

abrown · 2020-02-14T02:46:31Z

For ARM in V8 it takes 1 instruction to lower f32x4.convert_i32x4_u.

jan-wassenberg · 2020-02-14T06:28:05Z

Is it uncommon for these instructions to be used?

32 bit [u]int<->float conversions are indeed common. Unfortunately, I doubt many developers are aware that signed is cheaper than unsigned on x86. If they were, signed would be sufficient in a large majority of cases; they would just have to cast to int first.

abrown · 2020-02-14T17:55:19Z

cc: @rrwinterton, @arunetm

arunetm · 2020-02-14T21:09:06Z

32 bit [u]int<->float conversions are indeed common. Unfortunately, I doubt many developers are aware that signed is cheaper than unsigned on x86. If they were, signed would be sufficient in a large majority of cases; they would just have to cast to int first.

Agree that the inefficiency of unsigned 32 bit conversions seems to be an easy to miss detail for developers. The spec's criteria for instr evaluation calls out for good support across modern architectures, avoidance of perf cliffs and evidences of wide app/library usages. The 8 instr vs, 1 instr is a high difference in cost of implementation IMO and could very well hide unexpected perf cliffs. So we may need to qualify their inclusion with more reasoning.

Any objections on signed 32 bit conversions being sufficient for the majority of cases?

rrwinterton · 2020-02-14T22:37:09Z

True this is a problem since we don't have the AVX3 ISA but this is fixed in ICL. Some cpu's can't take advantage of this moving forward. Jan you may find this interesting. https://godbolt.org/z/JD3mjf See this instruction selected and a nice unroll. I think we can generate some really cool code if we do this right with the right header files and even extend the wasm_simd256.h to include potential AVX3 instructions with ymm registers. I have a wasm_simd256.h header file I am working on. We should use the workload you are using as a proposed mix to start with and see what AVX3 instructions can provide.

jan-wassenberg · 2020-02-16T08:55:37Z

@rrwinterton Nice, lots of goodies in AVX3, thanks for sharing :)
Good idea, I'd be happy to help with testing AVX2 and 3 benefits using JPEG XL.

dtig · 2020-02-18T22:04:06Z

I agree with @tlively about the ergonomic loss of not having the unsigned conversions, and it would be interesting to see what the exact application usage is.

This case is also particularly interesting because as @rrwinterton pointed out, the vcvtudq2ps instruction available with the AVX512 feature flag offers a 128-bit version of this for a single instruction implementation, so it is possible that engines in the future are able to take advantage of this. A broader question to answer here, how do we weigh current performance inefficiencies which may be addressed in the future with engines having better vector support?

My vote here would be for a portable abstraction that is future proof, i.e. if there are AVX+ instructions that map to one instruction, we should keep these in so engines can optimize when support is available, and this does fit the criteria of support being available in modern architectures, it's just that engines don't support them for codegen at this time. Choosing not to do this would be that we fall into trying to standardize subsets of extensions as they become available, and that model doesn't sound very appealing. As this is mostly an Intel-ism that IIRC we've discussed in the past (possibly before my time on this proposal), I'm curious to see what other opinions there are about this.

rrwinterton · 2020-03-03T21:27:48Z

In order to take advantage of AVX3 we would have to detect the platforms capable of doing the AVX3 VL instructions. There are several that we should be able to reduce the number of instructions for Intel platforms if they are capable of AVX3. I am creating a mapping of those instructions along with the promotion proposal for 256 length vectors. I am wondering if we don't use any native AVX2 and AVX ISA and just use AVX3 VL for the next generation by the time we move to the next gen WASM SIMD? Maybe in about 3 weeks I can present this idea? I will let @dtig decide the time frame. earliest is 3+ weeks for me.

zeux · 2020-03-03T23:12:41Z

I feel like AVX512 is not going to be practical for years to come as the dominant target ISA. The first laptop CPU that supports it shipped late last year. There’s no support from AMD side right now either. It’s interesting as a codegen option but not as a baseline.

dtig mentioned this issue Mar 4, 2020

Agenda for sync meeting 03/06/2020 #202

Closed

abrown mentioned this issue Mar 9, 2020

Add SIMD int-to-float conversion bytecodealliance/wasmtime#1263

Merged

arunetm mentioned this issue Apr 2, 2020

Agenda for sync meeting 04/03/2020 #210

Closed

arunetm mentioned this issue Nov 9, 2020

Agenda for sync meeting 11/13/20 #390

Closed

ngzhian added the perf documentation label Mar 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inefficient x64 codegen for conversion instructions #190

Inefficient x64 codegen for conversion instructions #190

abrown commented Feb 6, 2020 •

edited

Loading

zeux commented Feb 7, 2020

abrown commented Feb 7, 2020

tlively commented Feb 14, 2020

zeux commented Feb 14, 2020

abrown commented Feb 14, 2020

jan-wassenberg commented Feb 14, 2020

abrown commented Feb 14, 2020

arunetm commented Feb 14, 2020

rrwinterton commented Feb 14, 2020

jan-wassenberg commented Feb 16, 2020

dtig commented Feb 18, 2020

rrwinterton commented Mar 3, 2020

zeux commented Mar 3, 2020

Inefficient x64 codegen for conversion instructions #190

Inefficient x64 codegen for conversion instructions #190

Comments

abrown commented Feb 6, 2020 • edited Loading

zeux commented Feb 7, 2020

abrown commented Feb 7, 2020

tlively commented Feb 14, 2020

zeux commented Feb 14, 2020

abrown commented Feb 14, 2020

jan-wassenberg commented Feb 14, 2020

abrown commented Feb 14, 2020

arunetm commented Feb 14, 2020

rrwinterton commented Feb 14, 2020

jan-wassenberg commented Feb 16, 2020

dtig commented Feb 18, 2020

rrwinterton commented Mar 3, 2020

zeux commented Mar 3, 2020

abrown commented Feb 6, 2020 •

edited

Loading