Using FMA everywhere hurts performance Cool one: Fused multiply accumulate (FMA)

Slides:



Advertisements
Similar presentations
Carnegie Mellon Today Program optimization  Optimization blocker: Memory aliasing  Out of order processing: Instruction level parallelism  Understanding.
Advertisements

University of Washington Procedures and Stacks II The Hardware/Software Interface CSE351 Winter 2013.
Introduction to Machine/Assembler Language Noah Mendelsohn Tufts University Web:
Copyright 2014 – Noah Mendelsohn UM Macro Assembler Functions Noah Mendelsohn Tufts University Web:
Machine/Assembler Language Putting It All Together Noah Mendelsohn Tufts University Web:
Comp Sci Floating Point Arithmetic 1 Ch. 10 Floating Point Unit.
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 3 Arithmetic for Computers.
{ Optimizing C63 for x86 Group 9.  Bird’s-eye view: gprof of reference encoder  Optimizing SAD  Results Outline.
Windows XP SP2 Stack Protection Jimmy Hermansson Johan Tibell.
University of Washington Today More on procedures, stack etc. Lab 2 due today!  We hope it was fun! What is a stack?  And how about a stack frame? 1.
COMPUTER ARCHITECTURE (P175B125)
December 2, 2015Single-Instruction Multiple Data (SIMD)1 Performance Optimization, cont. How do we fix performance problems?
Introdution to SSE or How to put your algorithms on steroids! Christian Kerl
Chapter 2 Parts of a Computer System. 2.1 PC Hardware: Memory.
Auto-Vectorization Jim Hogg Program Manager Visual C++ Compiler Microsoft Corporation.
Reminder Bomb lab is due tomorrow! Attack lab is released tomorrow!!
Operand Addressing And Instruction Representation Tutorial 3.
for (int i = 0; i < numBodies; i++) { float_3 netAccel; netAccel.x = netAccel.y = netAccel.z = 0; for (int j = 0; j < numBodies; j++) { float_3 r;
3/17/20161 Operating Systems Design (CS 423) Elsa L Gunter 2112 SC, UIUC Based on slides by Roy Campbell, Sam King,
Subword Parallellism Graphics and audio applications can take advantage of performing simultaneous operations on short vectors – Example: 128-bit adder:
Spring 2016Assembly Review Roadmap 1 car *c = malloc(sizeof(car)); c->miles = 100; c->gals = 17; float mpg = get_mpg(c); free(c); Car c = new Car(); c.setMiles(100);
Introduction to Computer Systems Topics: Assembly Stack discipline Structs/alignment Caching CS 213 S ’12 rec8.pdf “The Class That Gives CMU Its.
Instruction Set Architecture
Machine-Level Programming I: Basics
Credits and Disclaimers
CSCE 212 Computer Architecture
Assembly Programming II CSE 351 Spring 2017
Instructor: Your TA(s)
CS 286 Computer Architecture & Organization
Today’s Instructor: Phil Gibbons
SIMD Multimedia Extensions
143A: Principles of Operating Systems Lecture 4: Calling conventions
Exploiting Parallelism
The Stack & Procedures CSE 351 Spring 2017
Program Optimization CENG331 - Computer Organization
Instructors: Pelin Angin and Erol Sahin
Program Optimization CSCE 312
Recitation: Attack Lab
Introduction to Intel x86-64 Assembly, Architecture, Applications, & Alliteration Xeno Kovah – 2014 xkovah at gmail.
Program Optimization II
Introduction to Intel x86-64 Assembly, Architecture, Applications, & Alliteration Xeno Kovah – 2014 xkovah at gmail.
Machine-Dependent Optimization
Machine-Level Programming III: Procedures /18-213/14-513/15-513: Introduction to Computer Systems 7th Lecture, September 18, 2018.
Code Optimization /18-213/14-513/15-513: Introduction to Computer Systems 10th Lecture, September 27, 2018.
Carnegie Mellon Machine-Level Programming III: Procedures : Introduction to Computer Systems October 22, 2015 Instructor: Rabi Mahapatra Authors:
Introduction to Intel x86-64 Assembly, Architecture, Applications, & Alliteration Xeno Kovah – 2014 xkovah at gmail.
Introduction to Intel x86-64 Assembly, Architecture, Applications, & Alliteration Xeno Kovah – 2014 xkovah at gmail.
Code Optimization and Performance
Introduction to Intel x86-64 Assembly, Architecture, Applications, & Alliteration Xeno Kovah – 2014 xkovah at gmail.
x86-64 Programming I CSE 351 Summer 2018
The Stack & Procedures CSE 351 Winter 2018
Machine Level Representation of Programs (IV)
Roadmap C: Java: Assembly language: OS: Machine code: Computer system:
Ithaca College Machine-Level Programming VII: Procedures Comp 21000: Introduction to Computer Systems & Assembly Lang Spring 2017.
Oct 15, 2018 Instructor: Your TA(s) 1.
Machine-Level Programming III: Arithmetic Comp 21000: Introduction to Computer Organization & Systems March 2017 Systems book chapter 3* * Modified slides.
Computer Architecture and System Programming Laboratory
/Сергей Смитиенко/.
Machine-Level Representation of Programs (x86-64)
Machine-Level Programming II: Basics Comp 21000: Introduction to Computer Organization & Systems Instructor: John Barr * Modified slides from the book.
Get To Know Your Compiler
Machine-Level Programming III: Arithmetic Comp 21000: Introduction to Computer Organization & Systems March 2017 Systems book chapter 3* * Modified slides.
Computer Architecture and System Programming Laboratory
Program Optimization CSE 238/2038/2138: Systems Programming
Ithaca College Machine-Level Programming VII: Procedures Comp 21000: Introduction to Computer Systems & Assembly Lang Spring 2017.
Lecture 11: Machine-Dependent Optimization
Credits and Disclaimers
Visual Studio x64 C Compiler function entrance code
Computer Architecture and System Programming Laboratory
Instructor: Your TA(s)
Presentation transcript:

Using FMA everywhere hurts performance Cool one: Fused multiply accumulate (FMA)

//... stuff... x[0] = y[0]; // 128b copy x[1] = y[1]; // 128b copy //... stuff... This may cause huge slowdowns This may cause huge slowdowns on some chips //... stuff... x = y; // 256b copy //... stuff... “optimized”

What?

Intel Pentium 4 (2001) AMD Athlon 64 (2003) 128 bit SIMD instructions /arch:SSE2 Visual Studio.NET 2003 Intel Sandy Bridge (2011) AMD Bulldozer (2011) FP 256 bit SIMD instructions /arch:AVX Visual Studio 2010 Intel Haswell (2013) Future AMD Chip (?) 256 bit SIMD instructions /arch:AVX2 Visual Studio 2013 Update 2 (optimization support) Intel Pentium 3 (1999) AMD Athlon XP (2001) Some 128 bit SIMD instructions /arch:SSE Visual C++ ? New hotness!

_mm_fmadd_ss, _mm_fmsub_ss, _mm_fnmadd_ss, _mm_fnmsub_ss, _mm_fmadd_sd, _mm_fmsub_sd, _mm_fnmadd_sd, _mm_fnmsub_sd, _mm_fmadd_ps, _mm_fmsub_ps, _mm_fnmadd_ps, _mm_fnmsub_ps, _mm_fmadd_pd, _mm_fmsub_pd, _mm_fnmadd_pd, _mm_fnmsub_pd, _mm256_fmadd_ps, _mm256_fmsub_ps, _mm256_fnmadd_ps, _mm256_fnmsub_ps, _mm256_fmadd_pd, _mm256_fmsub_pd, _mm256_fnmadd_pd, _mm256_fnmsub_pd /arch:AVX2

res AB C 5 cycles 3 cycles Mult = 5 cycles Add = 3 cycles FMA = 5 cycles res A B C 5 cycles

res ABCD A B CD 5 cycles 3 cycles 5 cycles Mult = 5 cycles Add = 3 cycles FMA = 5 cycles

... A[5]B[5] dp A[6]B[6] t1... A[6] B[6] t2 A[5] B[5]... 5 cycles 3 cycles 5 cycles 3 cycles Mult = 5 cycles Add = 3 cycles FMA = 5 cycles

Highly optimized CPU code isn’t CPU code.

for (i=0; i<1000; i++) A[i] = B[i] + C[i]; for (i=0; i<1000; i++) A[i] = B[i] + C[i]; for (i=0; i<1000; i+=4) xmm1 = vmovups B[i] xmm2 = vaddps xmm1, C[i] A[i] = vmovups xmm2autovec for (i=0; i<1000; i+=8) ymm1 = vmovups B[i] ymm2 = vaddps ymm1, C[i] A[i] = vmovups ymm2autovec

32-bit float scalar Total: 100 ms CPU: 80 ms Mem: 20 ms 128-bit SIMD Total: 40 ms CPU: 20 ms Mem: 20 ms 256-bit SIMD Total: 30 ms CPU: 10 ms Mem: 20 ms 2.5x speedup 1.3x speedup Memory Bound Highly optimized CPU code isn’t CPU code.

Windows task manager won’t help you here

Courtesy of

8.5 ms 6.4 ms10 ms enh yay this sucks

struct MyData { Vector4D v1; // 4 floats Vector4D v2; // 4 floats }; MyData x; MyData y; void func2() { //... unrelated stuff... func3(); //... unrelated stuff... x.v1 = y.v1; // 128-bit copy x.v2 = y.v2; // 128-bit copy x = y; // 256-bit copy } This caused the 60% slowdown on Haswell

bugs deathly potholes

void func1() { for (int i = 0; i<10000; i++) func2(); } void func2() { //... unrelated stuff... func3(); //... unrelated stuff... x = y; // 256-bit copy } void func3() { //... unrelated stuff = x.v1; // 128-bit load from x } vmovups YMMWORD PTR [rbx], ymm0 mov rcx, QWORD PTR __$ArrayPad$[rsp] xor rcx, rsp call __security_check_cookie add rsp, 80 ; H pop rbx ret 0 push rbx sub rsp, 80 ; H mov rax, QWORD PTR __security_cookie xor rax, rsp mov QWORD PTR __$ArrayPad$[rsp], rax mov rbx, r8 mov r8, rdx mov rdx, rcx lea rcx, QWORD PTR $T1[rsp] mov rax, rsp mov QWORD PTR [rax+8], rbx mov QWORD PTR [rax+16], rsi push rdi sub rsp, 144 ; H vmovaps XMMWORD PTR [rax-24], xmm6 vmovaps XMMWORD PTR [rax-40], xmm7 vmovaps XMMWORD PTR [rax-56], xmm8 mov rsi, r8 mov rdi, rdx mov rbx, rcx vmovaps XMMWORD PTR [rax-72], xmm9 vmovaps XMMWORD PTR [rax-88], xmm10 vmovaps XMMWORD PTR [rax-104], xmm11 vmovaps XMMWORD PTR [rax-120], xmm12 vmovdqu xmm12, XMMWORD PTR test cl, 15 je SHORT lea rdx, OFFSET lea rcx, OFFSET mov r8d, 78 ; eH call _wassert vmovupd xmm11, XMMWORD PTR [rsi] vmovupd xmm10, XMMWORD PTR [rsi+16]

The performance landscape is changing. Get to know your profiler.

Intel Pentium 4 (2001) AMD Athlon 64 (2003) 128 bit SIMD instructions /arch:SSE2 Visual Studio.NET 2003 Intel Sandy Bridge (2011) AMD Bulldozer (2011) FP 256 bit SIMD instructions /arch:AVX Visual Studio 2010 Intel Haswell (2013) Future AMD Chip (?) 256 bit SIMD instructions /arch:AVX2 Visual Studio 2013 Update 2 (optimization support) Intel Pentium 3 (1999) AMD Athlon XP (2001) Some 128 bit SIMD instructions /arch:SSE Visual C++ ?

for MSDN Ultimate subscribers Go to SPECIAL OFFERS Partner Program

Profile your code