Skip to content
This repository was archived by the owner on Dec 22, 2021. It is now read-only.

Inefficient x64 codegen for conversion instructions #190

Open
abrown opened this issue Feb 6, 2020 · 13 comments
Open

Inefficient x64 codegen for conversion instructions #190

abrown opened this issue Feb 6, 2020 · 13 comments

Comments

@abrown
Copy link
Contributor

abrown commented Feb 6, 2020

Certain SIMD conversions seem to have inefficient lowerings in x64. f32x4.convert_i32x4_u is lowered to 8 instruction by v8. The signed version, f32x4.convert_i32x4_s, on the other hand, is lowered to a single instruction.

I can't find the v8 implementation for i32x4.trunc_sat_f32x4_s and i32x4.trunc_sat_f32x4_u but I think the situation is the same: the signed version should have a single instruction lowering to CVTTPS2DQ and the unsigned version will require some longer sequence. [edit: this is incorrect, see #173 for a more correct discussion of this inefficiency]

The 64x2 versions of these instructions were dropped in #178. For similar reasons (@ngzhian: "because it is uncommon for such instructions to be used, and hardware support is not widespread"), should we remove the unsigned versions?

@zeux
Copy link
Contributor

zeux commented Feb 7, 2020

the signed version should have a single instruction lowering to CVTTPS2DQ

This is not the case, see #173.

@abrown
Copy link
Contributor Author

abrown commented Feb 7, 2020

Ah, good point; I didn't read #173 carefully enough. Let me edit out that part of the issue above.

@tlively
Copy link
Member

tlively commented Feb 14, 2020

Is it uncommon for these instructions to be used? At least as a user if I'm trying to convert a 64x2 vector, I only have two lanes to manually convert. If we remove the 32x4 conversions, users would have to write 4x more code to convert each lane. That ergonomic loss makes it more important to have 32x4 conversions than 64x2 conversions even if they are similarly unpopular. My guess is that the 32x4 conversions would be more common, although I do no know for sure.

@zeux
Copy link
Contributor

zeux commented Feb 14, 2020

Note that 64-bit conversions were dropped because they didn’t have an efficient implementation on any architecture, short of AVX512. What does lowering of unsigned conversions look like for ARM/Power?

@abrown
Copy link
Contributor Author

abrown commented Feb 14, 2020

For ARM in V8 it takes 1 instruction to lower f32x4.convert_i32x4_u.

@jan-wassenberg
Copy link

Is it uncommon for these instructions to be used?

32 bit [u]int<->float conversions are indeed common. Unfortunately, I doubt many developers are aware that signed is cheaper than unsigned on x86. If they were, signed would be sufficient in a large majority of cases; they would just have to cast to int first.

@abrown
Copy link
Contributor Author

abrown commented Feb 14, 2020

cc: @rrwinterton, @arunetm

@arunetm
Copy link
Collaborator

arunetm commented Feb 14, 2020

32 bit [u]int<->float conversions are indeed common. Unfortunately, I doubt many developers are aware that signed is cheaper than unsigned on x86. If they were, signed would be sufficient in a large majority of cases; they would just have to cast to int first.

Agree that the inefficiency of unsigned 32 bit conversions seems to be an easy to miss detail for developers. The spec's criteria for instr evaluation calls out for good support across modern architectures, avoidance of perf cliffs and evidences of wide app/library usages. The 8 instr vs, 1 instr is a high difference in cost of implementation IMO and could very well hide unexpected perf cliffs. So we may need to qualify their inclusion with more reasoning.

Any objections on signed 32 bit conversions being sufficient for the majority of cases?

@rrwinterton
Copy link

True this is a problem since we don't have the AVX3 ISA but this is fixed in ICL. Some cpu's can't take advantage of this moving forward. Jan you may find this interesting. https://godbolt.org/z/JD3mjf See this instruction selected and a nice unroll. I think we can generate some really cool code if we do this right with the right header files and even extend the wasm_simd256.h to include potential AVX3 instructions with ymm registers. I have a wasm_simd256.h header file I am working on. We should use the workload you are using as a proposed mix to start with and see what AVX3 instructions can provide.

@jan-wassenberg
Copy link

@rrwinterton Nice, lots of goodies in AVX3, thanks for sharing :)
Good idea, I'd be happy to help with testing AVX2 and 3 benefits using JPEG XL.

@dtig
Copy link
Member

dtig commented Feb 18, 2020

I agree with @tlively about the ergonomic loss of not having the unsigned conversions, and it would be interesting to see what the exact application usage is.

This case is also particularly interesting because as @rrwinterton pointed out, the vcvtudq2ps instruction available with the AVX512 feature flag offers a 128-bit version of this for a single instruction implementation, so it is possible that engines in the future are able to take advantage of this. A broader question to answer here, how do we weigh current performance inefficiencies which may be addressed in the future with engines having better vector support?

My vote here would be for a portable abstraction that is future proof, i.e. if there are AVX+ instructions that map to one instruction, we should keep these in so engines can optimize when support is available, and this does fit the criteria of support being available in modern architectures, it's just that engines don't support them for codegen at this time. Choosing not to do this would be that we fall into trying to standardize subsets of extensions as they become available, and that model doesn't sound very appealing. As this is mostly an Intel-ism that IIRC we've discussed in the past (possibly before my time on this proposal), I'm curious to see what other opinions there are about this.

@rrwinterton
Copy link

In order to take advantage of AVX3 we would have to detect the platforms capable of doing the AVX3 VL instructions. There are several that we should be able to reduce the number of instructions for Intel platforms if they are capable of AVX3. I am creating a mapping of those instructions along with the promotion proposal for 256 length vectors. I am wondering if we don't use any native AVX2 and AVX ISA and just use AVX3 VL for the next generation by the time we move to the next gen WASM SIMD? Maybe in about 3 weeks I can present this idea? I will let @dtig decide the time frame. earliest is 3+ weeks for me.

@zeux
Copy link
Contributor

zeux commented Mar 3, 2020

I feel like AVX512 is not going to be practical for years to come as the dominant target ISA. The first laptop CPU that supports it shipped late last year. There’s no support from AMD side right now either. It’s interesting as a codegen option but not as a baseline.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants