Using FMA everywhere hurts performance Cool one: Fused multiply accumulate (FMA)

//... stuff... x[0] = y[0]; // 128b copy x[1] = y[1]; // 128b copy //... stuff... This may cause huge slowdowns This may cause huge slowdowns on some chips //... stuff... x = y; // 256b copy //... stuff... “optimized”

Intel Pentium 4 (2001) AMD Athlon 64 (2003) 128 bit SIMD instructions /arch:SSE2 Visual Studio.NET 2003 Intel Sandy Bridge (2011) AMD Bulldozer (2011) FP 256 bit SIMD instructions /arch:AVX Visual Studio 2010 Intel Haswell (2013) Future AMD Chip (?) 256 bit SIMD instructions /arch:AVX2 Visual Studio 2013 Update 2 (optimization support) Intel Pentium 3 (1999) AMD Athlon XP (2001) Some 128 bit SIMD instructions /arch:SSE Visual C++ ? New hotness!

_mm_fmadd_ss, _mm_fmsub_ss, _mm_fnmadd_ss, _mm_fnmsub_ss, _mm_fmadd_sd, _mm_fmsub_sd, _mm_fnmadd_sd, _mm_fnmsub_sd, _mm_fmadd_ps, _mm_fmsub_ps, _mm_fnmadd_ps, _mm_fnmsub_ps, _mm_fmadd_pd, _mm_fmsub_pd, _mm_fnmadd_pd, _mm_fnmsub_pd, _mm256_fmadd_ps, _mm256_fmsub_ps, _mm256_fnmadd_ps, _mm256_fnmsub_ps, _mm256_fmadd_pd, _mm256_fmsub_pd, _mm256_fnmadd_pd, _mm256_fnmsub_pd /arch:AVX2

res AB C 5 cycles 3 cycles Mult = 5 cycles Add = 3 cycles FMA = 5 cycles res A B C 5 cycles

res ABCD A B CD 5 cycles 3 cycles 5 cycles Mult = 5 cycles Add = 3 cycles FMA = 5 cycles

... A[5]B[5] dp A[6]B[6] t1... A[6] B[6] t2 A[5] B[5]... 5 cycles 3 cycles 5 cycles 3 cycles Mult = 5 cycles Add = 3 cycles FMA = 5 cycles

Highly optimized CPU code isn’t CPU code.

for (i=0; i<1000; i++) A[i] = B[i] + C[i]; for (i=0; i<1000; i++) A[i] = B[i] + C[i]; for (i=0; i<1000; i+=4) xmm1 = vmovups B[i] xmm2 = vaddps xmm1, C[i] A[i] = vmovups xmm2autovec for (i=0; i<1000; i+=8) ymm1 = vmovups B[i] ymm2 = vaddps ymm1, C[i] A[i] = vmovups ymm2autovec

32-bit float scalar Total: 100 ms CPU: 80 ms Mem: 20 ms 128-bit SIMD Total: 40 ms CPU: 20 ms Mem: 20 ms 256-bit SIMD Total: 30 ms CPU: 10 ms Mem: 20 ms 2.5x speedup 1.3x speedup Memory Bound Highly optimized CPU code isn’t CPU code.

Windows task manager won’t help you here

Courtesy of http://eigen.tuxfamily.org/

8.5 ms 6.4 ms10 ms enh yay this sucks

struct MyData { Vector4D v1; // 4 floats Vector4D v2; // 4 floats }; MyData x; MyData y; void func2() { //... unrelated stuff... func3(); //... unrelated stuff... x.v1 = y.v1; // 128-bit copy x.v2 = y.v2; // 128-bit copy x = y; // 256-bit copy } This caused the 60% slowdown on Haswell

bugs deathly potholes

void func1() { for (int i = 0; i<10000; i++) func2(); } void func2() { //... unrelated stuff... func3(); //... unrelated stuff... x = y; // 256-bit copy } void func3() { //... unrelated stuff...... = x.v1; // 128-bit load from x } vmovups YMMWORD PTR [rbx], ymm0 mov rcx, QWORD PTR __$ArrayPad$[rsp] xor rcx, rsp call __security_check_cookie add rsp, 80 ; 00000050H pop rbx ret 0 push rbx sub rsp, 80 ; 00000050H mov rax, QWORD PTR __security_cookie xor rax, rsp mov QWORD PTR __$ArrayPad$[rsp], rax mov rbx, r8 mov r8, rdx mov rdx, rcx lea rcx, QWORD PTR $T1[rsp] mov rax, rsp mov QWORD PTR [rax+8], rbx mov QWORD PTR [rax+16], rsi push rdi sub rsp, 144 ; 00000090H vmovaps XMMWORD PTR [rax-24], xmm6 vmovaps XMMWORD PTR [rax-40], xmm7 vmovaps XMMWORD PTR [rax-56], xmm8 mov rsi, r8 mov rdi, rdx mov rbx, rcx vmovaps XMMWORD PTR [rax-72], xmm9 vmovaps XMMWORD PTR [rax-88], xmm10 vmovaps XMMWORD PTR [rax-104], xmm11 vmovaps XMMWORD PTR [rax-120], xmm12 vmovdqu xmm12, XMMWORD PTR __xmm@0000000000000000 test cl, 15 je SHORT $LN14@run lea rdx, OFFSET FLAT:??_C@_1FM@KGHGDLJC@ lea rcx, OFFSET FLAT:??_C@_1BIM@JPMPBING@ mov r8d, 78 ; 0000004eH call _wassert $LN14@run: vmovupd xmm11, XMMWORD PTR [rsi] vmovupd xmm10, XMMWORD PTR [rsi+16]

The performance landscape is changing. Get to know your profiler.

Intel Pentium 4 (2001) AMD Athlon 64 (2003) 128 bit SIMD instructions /arch:SSE2 Visual Studio.NET 2003 Intel Sandy Bridge (2011) AMD Bulldozer (2011) FP 256 bit SIMD instructions /arch:AVX Visual Studio 2010 Intel Haswell (2013) Future AMD Chip (?) 256 bit SIMD instructions /arch:AVX2 Visual Studio 2013 Update 2 (optimization support) Intel Pentium 3 (1999) AMD Athlon XP (2001) Some 128 bit SIMD instructions /arch:SSE Visual C++ ?

for MSDN Ultimate subscribers Go to http://msdn.Microsoft.com/specialoffershttp://msdn.Microsoft.com/specialoffers SPECIAL OFFERS Partner Program

Profile your code

Using FMA everywhere hurts performance Cool one: Fused multiply accumulate (FMA)

Similar presentations

Presentation on theme: "Using FMA everywhere hurts performance Cool one: Fused multiply accumulate (FMA)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Using FMA everywhere hurts performance Cool one: Fused multiply accumulate (FMA)

Similar presentations

Presentation on theme: "Using FMA everywhere hurts performance Cool one: Fused multiply accumulate (FMA)"— Presentation transcript:

Similar presentations

About project

Feedback