Skip to content

Commit

Permalink
Fix an unfortunate bug with Visual Studio 2015
Browse files Browse the repository at this point in the history
Evidently this instruction, despite the intrinsic having a register operand,
is a memory-register instruction. There seems to be no alignment requirement
for the source operand. Because of this, compilers when not optimized are doing
the unaligned load and then dumping back to the stack to do the broadcasting load.
In doing this, MSVC seems to be dumping to the stack with an aligned move at an
unaligned address, causing a segfault.  GCC does not seem to make this mistake, as
it stashes to an aligned address.

If we're on Visual Studio 2015, let's just do the longer 9 cycle sequence of a 128
bit load followed by a vinserti128. This _should_ fix this (issue zlib-ng#1861).
  • Loading branch information
KungFuJesus authored and Dead2 committed Feb 4, 2025
1 parent a3c0430 commit 287c4dc
Show file tree
Hide file tree
Showing 2 changed files with 16 additions and 0 deletions.
6 changes: 6 additions & 0 deletions arch/x86/chunkset_avx2.c
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,13 @@ static inline void chunkmemset_8(uint8_t *from, chunk_t *chunk) {
}

static inline void chunkmemset_16(uint8_t *from, chunk_t *chunk) {
/* See explanation in chunkset_avx512.c */
#if defined(_MSC_VER) && _MSC_VER <= 1900
halfchunk_t half = _mm_loadu_si128((__m128i*)from);
*chunk = _mm256_inserti128_si256(_mm256_castsi128_si256(half), half, 1);
#else
*chunk = _mm256_broadcastsi128_si256(_mm_loadu_si128((__m128i*)from));
#endif
}

static inline void loadchunk(uint8_t const *s, chunk_t *chunk) {
Expand Down
10 changes: 10 additions & 0 deletions arch/x86/chunkset_avx512.c
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,17 @@ static inline void chunkmemset_8(uint8_t *from, chunk_t *chunk) {
}

static inline void chunkmemset_16(uint8_t *from, chunk_t *chunk) {
/* Unfortunately there seems to be a compiler bug in Visual Studio 2015 where
* the load is dumped to the stack with an aligned move for this memory-register
* broadcast. The vbroadcasti128 instruction is 2 fewer cycles and this dump to
* stack doesn't exist if compiled with optimizations. For the sake of working
* properly in a debugger, let's take the 2 cycle penalty */
#if defined(_MSC_VER) && _MSC_VER <= 1900
halfchunk_t half = _mm_loadu_si128((__m128i*)from);
*chunk = _mm256_inserti128_si256(_mm256_castsi128_si256(half), half, 1);
#else
*chunk = _mm256_broadcastsi128_si256(_mm_loadu_si128((__m128i*)from));
#endif
}

static inline void loadchunk(uint8_t const *s, chunk_t *chunk) {
Expand Down

0 comments on commit 287c4dc

Please sign in to comment.