Fix an unfortunate bug with Visual Studio 2015

Evidently this instruction, despite the intrinsic having a register operand, is a memory-register instruction. There seems to be no alignment requirement for the source operand. Because of this, compilers when not optimized are doing the unaligned load and then dumping back to the stack to do the broadcasting load. In doing this, MSVC seems to be dumping to the stack with an aligned move at an unaligned address, causing a segfault. GCC does not seem to make this mistake, as it stashes to an aligned address. If we're on Visual Studio 2015, let's just do the longer 9 cycle sequence of a 128 bit load followed by a vinserti128. This _should_ fix this (issue zlib-ng#1861).
ccawley2011 · Feb 4, 2025 · 287c4dc · 287c4dc
1 parent a3c0430
commit 287c4dc
Show file tree

Hide file tree

Showing 2 changed files with 16 additions and 0 deletions.
diff --git a/arch/x86/chunkset_avx2.c b/arch/x86/chunkset_avx2.c
@@ -32,7 +32,13 @@ static inline void chunkmemset_8(uint8_t *from, chunk_t *chunk) {
 }
 
 static inline void chunkmemset_16(uint8_t *from, chunk_t *chunk) {
+    /* See explanation in chunkset_avx512.c */
+#if defined(_MSC_VER) && _MSC_VER <= 1900
+    halfchunk_t half = _mm_loadu_si128((__m128i*)from);
+    *chunk = _mm256_inserti128_si256(_mm256_castsi128_si256(half), half, 1);
+#else
     *chunk = _mm256_broadcastsi128_si256(_mm_loadu_si128((__m128i*)from));
+#endif
 }
 
 static inline void loadchunk(uint8_t const *s, chunk_t *chunk) {

diff --git a/arch/x86/chunkset_avx512.c b/arch/x86/chunkset_avx512.c
@@ -46,7 +46,17 @@ static inline void chunkmemset_8(uint8_t *from, chunk_t *chunk) {
 }
 
 static inline void chunkmemset_16(uint8_t *from, chunk_t *chunk) {
+    /* Unfortunately there seems to be a compiler bug in Visual Studio 2015 where
+     * the load is dumped to the stack with an aligned move for this memory-register
+     * broadcast. The vbroadcasti128 instruction is 2 fewer cycles and this dump to
+     * stack doesn't exist if compiled with optimizations. For the sake of working
+     * properly in a debugger, let's take the 2 cycle penalty */
+#if defined(_MSC_VER) && _MSC_VER <= 1900
+    halfchunk_t half = _mm_loadu_si128((__m128i*)from);
+    *chunk = _mm256_inserti128_si256(_mm256_castsi128_si256(half), half, 1);
+#else
     *chunk = _mm256_broadcastsi128_si256(_mm_loadu_si128((__m128i*)from));
+#endif
 }
 
 static inline void loadchunk(uint8_t const *s, chunk_t *chunk) {