You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Dec 22, 2021. It is now read-only.
Suppose you have two vectors u and v, and you want to multiply all elements of the vector u by a single lane of the vector v, e.g. v[0]. This is a very common thing to do, particularly in float matrix multiplication kernels.
This should be available for all multiplication instructions, including any multiply-add instructions if added to the spec. Float and integer. This will map directly to the corresponding instructions on ARM and will be implemented on x86 by using a broadcast instruction into a temporary vector.
Rationale for this programming model in WebAsm SIMD:
It's more expressive w.r.t. what many applications need to do.
The fallback is efficient provided well ordered instructions in the generated code. By contrast, the current lack of this instruction forces the WebAsm source to use separate broadcast instructions, which make it essentially impossible for the generated code to be efficient.
See ARM benchmarks in this spreadsheet.
Row 30, NEON_64bit_GEMM_Float32_WithVectorDuplicatingScalar, is the float kernel that one can write without such instructions.
Row 31, NEON_64bit_GEMM_Float32_WithScalar, is the faster float kernel that one can write with such instructions.
The text was updated successfully, but these errors were encountered:
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Suppose you have two vectors u and v, and you want to multiply all elements of the vector u by a single lane of the vector v, e.g. v[0]. This is a very common thing to do, particularly in float matrix multiplication kernels.
Example.
This should be available for all multiplication instructions, including any multiply-add instructions if added to the spec. Float and integer. This will map directly to the corresponding instructions on ARM and will be implemented on x86 by using a broadcast instruction into a temporary vector.
Rationale for this programming model in WebAsm SIMD:
See ARM benchmarks in this spreadsheet.
Row 30, NEON_64bit_GEMM_Float32_WithVectorDuplicatingScalar, is the float kernel that one can write without such instructions.
Row 31, NEON_64bit_GEMM_Float32_WithScalar, is the faster float kernel that one can write with such instructions.
The text was updated successfully, but these errors were encountered: