Include vectorized bit count instructions #6

stoklund · 2017-04-20T23:53:51Z

@AndrewScheidecker mentioned in his review of #1 the possibility of including vectorized bit counting instructions to match the existing scalar instructions. They would have these signatures:

i8x16.clz(x: v128) -> v128
i16x8.clz(x: v128) -> v128
i32x4.clz(x: v128) -> v128
i64x2.clz(x: v128) -> v128
i8x16.ctz(x: v128) -> v128
i16x8.ctz(x: v128) -> v128
i32x4.ctz(x: v128) -> v128
i64x2.ctz(x: v128) -> v128

At least AArch64 has vectorized CLZ and RBIT instructions that could be used to implement this. But they could be quite impractical to emulate on other platforms.

Are these instructions widely available in SIMD instruction sets?
Are there plausible applications for these instructions?

The text was updated successfully, but these errors were encountered:

simicicd · 2017-04-21T15:20:33Z

MIPS SIMD implementation (MSA instructions) defines following bit count instructions:

NLZC.df - Vector element count of leading bits set to 0.
NLOC.df - Vector element count of leading bits set to 1.
PCNT.df - Vector element count of all bits set to 1.
df is vector element size and can be 8,16,32 or 64b

There is no CTZ or RBIT instructions, but we could emulate CTZ maybe with 4-5 different instructions (including PCNT.df).

billbudge · 2017-04-21T18:45:24Z

ARM v7/v8 have:
Vector Count Leading Sign Bits VCLS
Vector Count Leading Zeros VCLZ
Vector Count Set Bits VCNT

stoklund · 2017-05-25T20:12:42Z

Intel has i32x4.clz and i64x2.clz, but only in AVX-512. I don't see vectorized versions of the other instructions.

gnzlbg · 2019-03-02T15:22:14Z

Are there plausible applications for these instructions?

Those who added these (e.g. i32x16::count_ones) to Rust's portable packed SIMD module use them in the implementation of vectorized PRNGs and cared mostly about AVX-512 support.

We have a vectorized ambient occlusion example and benchmark that uses a pretty poor vectorized PRNG heavily, and going from a scalar to a vectorized PRNG had a huge performance impact. I don't recall exactly how big this was for this benchmark (I think it was in the ballpark of 1.5-2x for that example, it was one of the latest optimizations we did to catch up with ISPC on performance), but it shouldn't be too hard to switch the PRNG back to a scalar one and get some numbers.

aqrit · 2021-01-16T02:33:03Z

SSE2+: CLZ/CTZ for 32/64 bit could use a floating point hack.
SSSE3: CLZ/CTZ for 8/16 bit could use PSHUFB.
CLZ/CTZ could be emulated on top of the popcnt.

SSE2 8-bit examples:

__m128i sse2_tzcnt_epi8(__m128i v) {
    const __m128i x00 = _mm_setzero_si128();
    const __m128i x55 = _mm_set1_epi8(0x55);
    const __m128i x33 = _mm_set1_epi8(0x33);
    const __m128i x0F = _mm_set1_epi8(0x0F);

    __m128i r = x00;
    v = _mm_and_si128(v, _mm_sub_epi8(x00, v)); // isolate ls1b
    r = _mm_avg_epu8(r, _mm_cmpeq_epi8(_mm_and_si128(x55, v), x00));
    r = _mm_avg_epu8(r, _mm_cmpeq_epi8(_mm_and_si128(x33, v), x00));
    r = _mm_avg_epu8(r, _mm_cmpeq_epi8(_mm_and_si128(x0F, v), x00));
    r = _mm_sub_epi8(_mm_srli_epi16(r, 5), _mm_cmpeq_epi8(v, x00));
    return r;
}

__m128i sse2_lzcnt_epi8 (__m128i v) {
    __m128i m0 = _mm_setzero_si128();
    __m128i m1 = _mm_set1_epi8(0x0F);
    __m128i m2 = _mm_set1_epi8(0x33);
    __m128i m3 = _mm_set1_epi8(0x55);

    m1 = _mm_and_si128(m1, v);
    v = _mm_max_epu8(_mm_xor_si128(v, m1), m1);
    m1 = _mm_cmpeq_epi8(m1, v);

    m2 = _mm_and_si128(m2, v);
    v = _mm_max_epu8(_mm_xor_si128(v, m2), m2);
    m2 = _mm_cmpeq_epi8(m2, v);

    m3 = _mm_and_si128(m3, v);
    v = _mm_max_epu8(_mm_xor_si128(v, m3), m3);
    m3 = _mm_cmpeq_epi8(m3, v);

    v = _mm_cmpeq_epi8(v, m0);
    m0 = _mm_avg_epu8(m0, m3);
    m0 = _mm_avg_epu8(m0, m2);
    m0 = _mm_avg_epu8(m0, m1);
    m0 = _mm_sub_epi8(_mm_srli_epi16(m0, 5), v);
    return m0;
}

Count Leading (Redundant) Sign Bits could be emulated as:

int my__builtin_clrsb (int val) {
    unsigned u = (unsigned)val;
    return __builtin_clz((u + u) ^ (u | 1));
}

dtig added the post SIMD MVP label Jan 23, 2019

bjorn3 mentioned this issue May 18, 2021

Create a Vector Math library to allow SimdF32::sin and similar to work in core rust-lang/portable-simd#109

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Include vectorized bit count instructions #6

Include vectorized bit count instructions #6

stoklund commented Apr 20, 2017

simicicd commented Apr 21, 2017

billbudge commented Apr 21, 2017

stoklund commented May 25, 2017

gnzlbg commented Mar 2, 2019

aqrit commented Jan 16, 2021

Include vectorized bit count instructions #6

Include vectorized bit count instructions #6

Comments

stoklund commented Apr 20, 2017

simicicd commented Apr 21, 2017

billbudge commented Apr 21, 2017

stoklund commented May 25, 2017

gnzlbg commented Mar 2, 2019

aqrit commented Jan 16, 2021