Using FMA everywhere hurts performance Cool one: Fused multiply accumulate (FMA)
//... stuff... x[0] = y[0]; // 128b copy x[1] = y[1]; // 128b copy //... stuff... This may cause huge slowdowns This may cause huge slowdowns on some chips //... stuff... x = y; // 256b copy //... stuff... “optimized”
What?
Intel Pentium 4 (2001) AMD Athlon 64 (2003) 128 bit SIMD instructions /arch:SSE2 Visual Studio.NET 2003 Intel Sandy Bridge (2011) AMD Bulldozer (2011) FP 256 bit SIMD instructions /arch:AVX Visual Studio 2010 Intel Haswell (2013) Future AMD Chip (?) 256 bit SIMD instructions /arch:AVX2 Visual Studio 2013 Update 2 (optimization support) Intel Pentium 3 (1999) AMD Athlon XP (2001) Some 128 bit SIMD instructions /arch:SSE Visual C++ ? New hotness!
_mm_fmadd_ss, _mm_fmsub_ss, _mm_fnmadd_ss, _mm_fnmsub_ss, _mm_fmadd_sd, _mm_fmsub_sd, _mm_fnmadd_sd, _mm_fnmsub_sd, _mm_fmadd_ps, _mm_fmsub_ps, _mm_fnmadd_ps, _mm_fnmsub_ps, _mm_fmadd_pd, _mm_fmsub_pd, _mm_fnmadd_pd, _mm_fnmsub_pd, _mm256_fmadd_ps, _mm256_fmsub_ps, _mm256_fnmadd_ps, _mm256_fnmsub_ps, _mm256_fmadd_pd, _mm256_fmsub_pd, _mm256_fnmadd_pd, _mm256_fnmsub_pd /arch:AVX2
res AB C 5 cycles 3 cycles Mult = 5 cycles Add = 3 cycles FMA = 5 cycles res A B C 5 cycles
res ABCD A B CD 5 cycles 3 cycles 5 cycles Mult = 5 cycles Add = 3 cycles FMA = 5 cycles
... A[5]B[5] dp A[6]B[6] t1... A[6] B[6] t2 A[5] B[5]... 5 cycles 3 cycles 5 cycles 3 cycles Mult = 5 cycles Add = 3 cycles FMA = 5 cycles
Highly optimized CPU code isn’t CPU code.
for (i=0; i<1000; i++) A[i] = B[i] + C[i]; for (i=0; i<1000; i++) A[i] = B[i] + C[i]; for (i=0; i<1000; i+=4) xmm1 = vmovups B[i] xmm2 = vaddps xmm1, C[i] A[i] = vmovups xmm2autovec for (i=0; i<1000; i+=8) ymm1 = vmovups B[i] ymm2 = vaddps ymm1, C[i] A[i] = vmovups ymm2autovec
32-bit float scalar Total: 100 ms CPU: 80 ms Mem: 20 ms 128-bit SIMD Total: 40 ms CPU: 20 ms Mem: 20 ms 256-bit SIMD Total: 30 ms CPU: 10 ms Mem: 20 ms 2.5x speedup 1.3x speedup Memory Bound Highly optimized CPU code isn’t CPU code.
Windows task manager won’t help you here
Courtesy of
8.5 ms 6.4 ms10 ms enh yay this sucks
struct MyData { Vector4D v1; // 4 floats Vector4D v2; // 4 floats }; MyData x; MyData y; void func2() { //... unrelated stuff... func3(); //... unrelated stuff... x.v1 = y.v1; // 128-bit copy x.v2 = y.v2; // 128-bit copy x = y; // 256-bit copy } This caused the 60% slowdown on Haswell
bugs deathly potholes
void func1() { for (int i = 0; i<10000; i++) func2(); } void func2() { //... unrelated stuff... func3(); //... unrelated stuff... x = y; // 256-bit copy } void func3() { //... unrelated stuff = x.v1; // 128-bit load from x } vmovups YMMWORD PTR [rbx], ymm0 mov rcx, QWORD PTR __$ArrayPad$[rsp] xor rcx, rsp call __security_check_cookie add rsp, 80 ; H pop rbx ret 0 push rbx sub rsp, 80 ; H mov rax, QWORD PTR __security_cookie xor rax, rsp mov QWORD PTR __$ArrayPad$[rsp], rax mov rbx, r8 mov r8, rdx mov rdx, rcx lea rcx, QWORD PTR $T1[rsp] mov rax, rsp mov QWORD PTR [rax+8], rbx mov QWORD PTR [rax+16], rsi push rdi sub rsp, 144 ; H vmovaps XMMWORD PTR [rax-24], xmm6 vmovaps XMMWORD PTR [rax-40], xmm7 vmovaps XMMWORD PTR [rax-56], xmm8 mov rsi, r8 mov rdi, rdx mov rbx, rcx vmovaps XMMWORD PTR [rax-72], xmm9 vmovaps XMMWORD PTR [rax-88], xmm10 vmovaps XMMWORD PTR [rax-104], xmm11 vmovaps XMMWORD PTR [rax-120], xmm12 vmovdqu xmm12, XMMWORD PTR test cl, 15 je SHORT lea rdx, OFFSET lea rcx, OFFSET mov r8d, 78 ; eH call _wassert vmovupd xmm11, XMMWORD PTR [rsi] vmovupd xmm10, XMMWORD PTR [rsi+16]
The performance landscape is changing. Get to know your profiler.
Intel Pentium 4 (2001) AMD Athlon 64 (2003) 128 bit SIMD instructions /arch:SSE2 Visual Studio.NET 2003 Intel Sandy Bridge (2011) AMD Bulldozer (2011) FP 256 bit SIMD instructions /arch:AVX Visual Studio 2010 Intel Haswell (2013) Future AMD Chip (?) 256 bit SIMD instructions /arch:AVX2 Visual Studio 2013 Update 2 (optimization support) Intel Pentium 3 (1999) AMD Athlon XP (2001) Some 128 bit SIMD instructions /arch:SSE Visual C++ ?
for MSDN Ultimate subscribers Go to SPECIAL OFFERS Partner Program
Profile your code