Optimizing compilers reload vector constants needlessly
Modern processors have powerful vector instructions which allow you to load several values at once, and operate (in one instruction) on all these values. Similarly, they allow you to have vector constants. Thus if you wanted to add some integer (say 10001) to all integers in a large array, you might first load a constant with 8 times the value 10001, then you would load elements from your array, 8 elements by 8 elements, add the vector constant (thus do 8 additions at once), and then store the result. Everything else being equal, this might be 8 times faster. An optimizing compiler might even do this optimization for you (a process called ‘auto-vectorization). However, for more complex code, you might need to do it manually using “intrinsic” functions (e.g., _mm256_loadu_si256, _mm256_add_epi32, etc.). Let us consider the simple case I describe, but where we process two arrays at once… using the same constant: #include #include void process_avx2(const uint32_t *in1, const uint32_t *in2, size_t len) { // define the constant, 8 x 10001 __m256i c = _mm256_set1_epi32(10001); const uint32_t *finalin1…
https://lemire.me/blog/2022/12/06/optimizing-compilers-reload-vector-constants-needlessly/
Modern processors have powerful vector instructions which allow you to load several values at once, and operate (in one instruction) on all these values. Similarly, they allow you to have vector constants. Thus if you wanted to add some integer (say 10001) to all integers in a large array, you might first load a constant with 8 times the value 10001, then you would load elements from your array, 8 elements by 8 elements, add the vector constant (thus do 8 additions at once), and then store the result. Everything else being equal, this might be 8 times faster. An optimizing compiler might even do this optimization for you (a process called ‘auto-vectorization). However, for more complex code, you might need to do it manually using “intrinsic” functions (e.g., _mm256_loadu_si256, _mm256_add_epi32, etc.). Let us consider the simple case I describe, but where we process two arrays at once… using the same constant: #include #include void process_avx2(const uint32_t *in1, const uint32_t *in2, size_t len) { // define the constant, 8 x 10001 __m256i c = _mm256_set1_epi32(10001); const uint32_t *finalin1…
https://lemire.me/blog/2022/12/06/optimizing-compilers-reload-vector-constants-needlessly/