1 BLIS Matrix Multiplication: from Real to Complex Field G. Van Zee.

Slides:



Advertisements
Similar presentations
Instruction Level Parallelism
Advertisements

1 Optimizing compilers Managing Cache Bercovici Sivan.
VLIW Very Large Instruction Word. Introduction Very Long Instruction Word is a concept for processing technology that dates back to the early 1980s. The.
Compiler techniques for exposing ILP
1 Anatomy of a High- Performance Many-Threaded Matrix Multiplication Tyler M. Smith, Robert A. van de Geijn, Mikhail Smelyanskiy, Jeff Hammond, Field G.
Cache Performance 1 Computer Organization II © CS:APP & McQuain Cache Memory and Performance Many of the following slides are taken with.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.
Computer Organization and Architecture
Computer Organization and Architecture
Tirgul 9 Amortized analysis Graph representation.
Chapter 12 Pipelining Strategies Performance Hazards.
Introduction to Computers and Programming. Some definitions Algorithm: –A procedure for solving a problem –A sequence of discrete steps that defines such.
02/09/2010CS267 Lecture 71 Notes on Homework 1 Must write SIMD code to get past 50% of peak!
Introduction to Computers and Programming. Some definitions Algorithm: Algorithm: A procedure for solving a problem A procedure for solving a problem.
EECC722 - Shaaban #1 Lec # 9 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.
1 CS101 Introduction to Computing Lecture 19 Programming Languages.
1 I.Introduction to Algorithm and Programming Algoritma dan Pemrograman – Teknik Informatika UK Petra 2009.
NCCS Brown Bag Series. Vectorization Efficient SIMD parallelism on NCCS systems Craig Pelissier* and Kareem Sorathia
Pointers (Continuation) 1. Data Pointer A pointer is a programming language data type whose value refers directly to ("points to") another value stored.
Programmer Defined Functions Matthew Verleger. Windows It’s estimated that Window’s XP contains 45 million lines of code (and it’s over 10 years old).
Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
Beyond GEMM: How Can We Make Quantum Chemistry Fast? or: Why Computer Scientists Don’t Like Chemists Devin Matthews 9/25/ BLIS Retreat1.
Algorithms and Programming
CS101 Introduction to Computing Lecture Programming Languages.
CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.
IT253: Computer Organization
CSC 221 Computer Organization and Assembly Language
1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.
Carnegie Mellon Generation of SIMD Dense Linear Algebra Kernels with Analytical Models Generation of SIMD Dense Linear Algebra Kernels with Analytical.
A (VERY) SHORT INTRODUCTION TO MATLAB J.A. MARR George Mason University School of Physics, Astronomy and Computational Sciences.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
FORTRAN History. FORTRAN - Interesting Facts n FORTRAN is the oldest Language actively in use today. n FORTRAN is still used for new software development.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
5/13/99 Ashish Sabharwal1 Pipelining and Hazards n Hazards occur because –Don’t have enough resources (ALU’s, memory,…) Structural Hazard –Need a value.
Static Identification of Delinquent Loads V.M. Panait A. Sasturkar W.-F. Fong.
Adding Algorithm Based Fault-Tolerance to BLIS Tyler Smith, Robert van de Geijn, Mikhail Smelyanskiy, Enrique Quintana-Ortí 1.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 20/11/2015 Compilers for Embedded Systems.
1 Exploiting BLIS to Optimize LU with Pivoting Xianyi Zhang
F453 Module 8: Low Level Languages 8.1: Use of Computer Architecture.
High-performance Implementations of Fast Matrix Multiplication with Strassen’s Algorithm Jianyu Huang with Tyler M. Smith, Greg M. Henry, Robert A. van.
Generating Families of Practical Fast Matrix Multiplication Algorithms
TI Information – Selective Disclosure
Eigenfaces (for Face Recognition)
Cache Memory and Performance
Using BLIS Building Blocks:
© Craig Zilles (adapted from slides by Howard Huang)
The Hardware/Software Interface CSE351 Winter 2013
Section 7: Memory and Caches
High-Performance Matrix Multiplication
CS101 Introduction to Computing Lecture 19 Programming Languages
Lecture 5: GPU Compute Architecture
Vector Processing => Multimedia
Lecture 5: GPU Compute Architecture for the last time
Using BLIS Building Blocks:
MATLAB Programming Indexing Copyright © Software Carpentry 2011
Recall: ROM example Here are three functions, V2V1V0, implemented with an 8 x 3 ROM. Blue crosses (X) indicate connections between decoder outputs and.
Mastering Memory Modes
ECE 352 Digital System Fundamentals
Memory System Performance Chapter 3
Chapter 11 Processor Structure and function
CSc 453 Final Code Generation
Presentation transcript:

1 BLIS Matrix Multiplication: from Real to Complex Field G. Van Zee

Acknowledgements Funding  NSF Award OCI : SI2-SSI: A Linear Algebra Software Infrastructure for Sustained Innovation in Computational Chemistry and other Sciences. (Funded June 1, May 31, 2015.)  Other sources (Intel, Texas Instruments) Collaborators  Tyler Smith, Tze Meng Low 2

Acknowledgements Journal papers  “BLIS: A Framework for Rapid Instantiation of BLAS Functionality” (accepted to TOMS)  “The BLIS Framework: Experiments in Portability” (accepted to TOMS pending minor modifications)  “Analytical Modeling is Enough for High Performance BLIS” (submitted to TOMS) Conference papers  “Anatomy of High-Performance Many-Threaded Matrix Multiplication” (accepted to IPDPS 2014) 3

Introduction Before we get started…  Let’s review the general matrix-matrix multiplication (gemm) as implemented by Kazushige Goto in GotoBLAS. [Goto and van de Geijn 2008] 4

The gemm algorithm 5 +=

The gemm algorithm 6 +=NC

The gemm algorithm 7 +=

The gemm algorithm 8 += KC

The gemm algorithm 9 +=

The gemm algorithm 10 += Pack row panel of B

The gemm algorithm 11 += Pack row panel of B NR

The gemm algorithm 12 +=

The gemm algorithm 13 += MC

The gemm algorithm 14 +=

The gemm algorithm 15 += Pack block of A

The gemm algorithm 16 += Pack block of A MR

The gemm algorithm 17 +=

Where the micro-kernel fits in Macro-kernel consists of three loops (cache block sizes)  NC dimension  MC dimension  KC dimension 18 += for ( 0 to NC-1 ) for ( 0 to MC-1 ) for ( 0 to KC-1 ) // outer product endfor NC KC MC

Where the micro-kernel fits in 19 += for ( 0 to NC-1 ) for ( 0 to MC-1 ) for ( 0 to KC-1 ) // outer product endfor

Where the micro-kernel fits in 20 += NR for ( 0 to NC-1: NR ) for ( 0 to MC-1 ) for ( 0 to KC-1 ) // outer product endfor

Where the micro-kernel fits in 21 += for ( 0 to NC-1: NR ) for ( 0 to MC-1 ) for ( 0 to KC-1 ) // outer product endfor

Where the micro-kernel fits in 22 MR += MR for ( 0 to NC-1: NR ) for ( 0 to MC-1: MR ) for ( 0 to KC-1 ) // outer product endfor

Where the micro-kernel fits in 23 += for ( 0 to NC-1: NR ) for ( 0 to MC-1: MR ) for ( 0 to KC-1 ) // outer product endfor

The gemm micro-kernel 24 += KCNR MR NR C A B for ( 0 to NC-1: NR ) for ( 0 to MC-1: MR ) for ( 0 to KC-1 ) // outer product endfor

C The gemm micro-kernel 25 += KCNR MR NR α1α1 α2α2 α3α3 α0α0 β1β1 β0β0 β2β2 β3β3 γ 00 γ 10 γ 20 γ 30 γ 01 γ 11 γ 21 γ 31 γ 02 γ 12 γ 22 γ 32 γ 03 γ 13 γ 23 γ 33 += A B for ( 0 to NC-1: NR ) for ( 0 to MC-1: MR ) for ( 0 to KC-1: 1 ) // outer product endfor Typical micro-kernel loop iteration  Load column of packed A  Load row of packed B  Compute outer product  Update C (kept in registers)

From real to complex HPC community focuses on real domain. Why?  Prevalence of real domain applications  Benchmarks  Complex domain has unique challenges 26

From real to complex HPC community focuses on real domain. Why?  Prevalence of real domain applications  Benchmarks  Complex domain has unique challenges 27

Challenges Programmability Floating-point latency / register set size Instruction set 28

Challenges Programmability Floating-point latency / register set size Instruction set 29

Programmability What do you mean?  Programmability of BLIS micro-kernel  Micro-kernel typically must be implemented in assembly language Ugh. Why assembly?  Compilers have trouble efficiently using vector instructions  Even using vector instrinsics tends to leave flops on the table 30

Programmability Okay fine, I’ll write my micro-kernel in assembly. It can’t be that bad, right?  I could show you actual assembly code, but…  This is supposed to be a retreat!  Diagrams are more illustrative anyway 31

Programmability Diagrams will depict rank-1 update. Why?  It’s the body of the micro-kernel’s loop! Instruction set  Similar to Xeon Phi Notation  α, β, γ are elements of matrices A, B, C, respectively Let’s begin with the real domain 32

Real rank-1 update in assembly 33 β1β1 β0β0 β2β2 β3β3 β0β0 β0β0 β0β0 β0β0 BCAST β1β1 β1β1 β1β1 β1β1 β2β2 β2β2 β2β2 β2β2 β3β3 β3β3 β3β3 β3β3 α1α1 α2α2 α3α3 α0α0 LOAD ADD αβ 00 αβ 10 αβ 30 αβ 20 αβ 01 αβ 11 αβ 31 αβ 21 αβ 02 αβ 12 αβ 32 αβ 22 αβ 03 αβ 13 αβ 33 αβ 23 γ 00 γ 10 γ 30 γ 20 γ 01 γ 11 γ 31 γ 21 γ 02 γ 12 γ 32 γ 22 γ 03 γ 13 γ 33 γ 23 MUL α0α0 α1α1 α3α3 α2α2 4 elements per vector register Implements 4 x 4 rank-1 update α 0:3, β 0:3 are real elements Load/swizzle instructions req’d:  L OAD  B ROADCAST Floating-point instructions req’d:  M ULTIPLY  A DD

Complex rank-1 update in assembly 34 4 elements per vector register Implements 2 x 2 rank-1 update α 0 + iα 1, α 2 + iα 3, β 0 + iβ 1, β 2 + iβ 3 are complex elements Load/swizzle instructions req’d:  L OAD  D UPLICATE  S HUFFLE (within “lanes”)  P ERMUTE (across “lanes”) Floating-point instructions req’d:  M ULTIPLY  A DD  S UBADD High values in micro-tile still need to be swapped (after loop) SUBADD β0β0 β0β0 β2β2 β2β2 β1β1 β1β1 β3β3 β3β3 β2β2 β2β2 β0β0 β0β0 β3β3 β3β3 β1β1 β1β1 LOAD αβ 00 αβ 10 αβ 32 αβ 22 αβ 11 αβ 01 αβ 23 αβ 33 αβ 02 αβ 12 αβ 30 αβ 20 αβ 13 αβ 03 αβ 21 αβ 31 γ 00 γ 10 γ 31 γ 21 γ 01 γ 11 γ 30 γ 20 α0α0 α1α1 α3α3 α2α2 α1α1 α0α0 α2α2 α3α3 SHUF DUP PERM MUL αβ 00 ‒αβ 11 αβ 10 + αβ 01 αβ 32 + αβ 23 αβ 22 ‒αβ 33 αβ 02 ‒αβ 13 αβ 12 + αβ 03 αβ 30 + αβ 21 αβ 20 ‒αβ 31 ADD α1α1 α2α2 α3α3 α0α0 β1β1 β0β0 β2β2 β3β3

Programmability Bottom line  Expressing complex arithmetic in assembly Awkward (at best) Tedious (potentially error-prone) May not even be possible if instructions are missing! Or may be possible but at lower performance (flop rate)  Assembly-coding real domain isn’t looking so bad now, is it? 35

Challenges Programmability Floating-point latency / register set size Instruction set 36

Latency / register set size Complex rank-1 update needs extra registers to hold intermediate results from “swizzle” instructions  But that’s okay! I can just reduce MR x NR (micro-tile footprint) because complex does four times as many flops!  Not quite: four times flops on twice data  Hrrrumph. Okay fine, twice as many flops per byte 37

Latency / register set size Actually, this two-fold flops-per-byte advantage for complex buys you nothing  Wait, what? Why? 38

What happened to my extra flops!?  They’re still there, but there is a problem… Latency / register set size 39

What happened to my extra flops!?  They’re still there, but there is a problem… Latency / register set size 40

What happened to my extra flops!?  They’re still there, but there is a problem…  Each element γ must be updated TWICE Latency / register set size 41

What happened to my extra flops!?  They’re still there, but there is a problem…  Each element γ must be updated TWICE Latency / register set size 42

Latency / register set size What happened to my extra flops!?  They’re still there, but there is a problem…  Each element γ must be updated TWICE 43

Latency / register set size What happened to my extra flops!?  They’re still there, but there is a problem…  Each element γ must be updated TWICE Complex rank-1 update = TWO real rank-1 updates 44

Latency / register set size What happened to my extra flops!?  They’re still there, but there is a problem…  Each element γ must be updated TWICE Complex rank-1 update = TWO real rank-1 updates  Each update of γ still requires a full latency period 45

Latency / register set size What happened to my extra flops!?  They’re still there, but there is a problem…  Each element γ must be updated TWICE Complex rank-1 update = TWO real rank-1 updates  Each update of γ still requires a full latency period  Yes, we get to execute twice as many flops, but we are forced to spend twice as long executing them! 46

Latency / register set size So I have to keep MR x NR the same?  Probably, yes (in bytes) And I still have to find registers to swizzle?  Yes 47

Latency / register set size So I have to keep MR x NR the same?  Probably, yes (in bytes) And I still have to find registers to swizzle?  Yes RvdG  “This is why I like to live my life as a double.” 48

Challenges Programmability Floating-point latency / register set size Instruction set 49

Instruction set Need more sophisticated swizzle instructions  D UPLICATE (in pairs)  S HUFFLE (within lanes)  P ERMUTE (across lanes) And floating-point instructions  S UBADD (subtract/add every other element) 50

Instruction set Number of operands addressed by the instruction set also matters  Three is better than two (SSE vs. AVX). Why?  Two-operand M ULTIPLY must overwrite one input operand What if you need to reuse that operand? You have to make a copy Copying increases the effective latency of the floating-point instruction 51

Let’s be friends! So what are the properties of complex- friendly hardware?  Low latency (e.g. M ULTIPLY /A DD instructions)  Lots of vector registers  Floating-point instructions with built-in swizzle Frees intermediate register for other purposes May shorten latency  Instructions that perform complex arithmetic (C OMPLEX M ULTIPLY /C OMPLEX A DD ) 52

Complex-friendly hardware Unfortunately, all of these issues must be taken into account during hardware design Either the hardware avoids the complex “performance hazard”, or it does not There is nothing the kernel programmer can do (except maybe befriend/bribe a hardware architect) and wait 3-5 years 53

Summary Complex matrix multiplication (and all level-3 BLAS-like operations) rely on a complex micro-kernel Complex micro-kernels, like their real counterparts, must be written in assembly language to achieve high performance The extra flops associated with complex do not make it any easier to write high-performance complex micro-kernels Coding complex arithmetic in assembly is demonstrably more difficult than real arithmetic  Need for careful orchestration on real/imaginary components (i.e. more difficult for humans to think about)  Increased demand on the register set  Need for more exotic instructions 54

Final thought 55

Final thought Suppose we had a magic box. You find that when you place a real matrix micro-kernel inside, it is transformed into a complex domain kernel (of the same precision). 56

Final thought Suppose we had a magic box. You find that when you place a real matrix micro-kernel inside, it is transformed into a complex domain kernel (of the same precision). The magic box rewards your efforts: This complex kernel achieves a high fraction of the performance (flops per byte) attained by your real kernel. 57

Final thought Suppose we had a magic box. You find that when you place a real matrix micro-kernel inside, it is transformed into a complex domain kernel (of the same precision). The magic box rewards your efforts: This complex kernel achieves a high fraction of the performance (flops per byte) attained by your real kernel. My question for you is: What fraction would it take for you to never write a complex kernel ever again? (That is, to simply use the magic box.) 58

Final thought Suppose we had a magic box. You find that when you place a real matrix micro-kernel inside, it is transformed into a complex domain kernel (of the same precision). The magic box rewards your efforts: This complex kernel achieves a high fraction of the performance (flops per byte) attained by your real kernel. My question for you is: What fraction would it take for you to never write a complex kernel ever again? (That is, to simply use the magic box.)  80%?... 90%? %? 59

Final thought Suppose we had a magic box. You find that when you place a real matrix micro-kernel inside, it is transformed into a complex domain kernel (of the same precision). The magic box rewards your efforts: This complex kernel achieves a high fraction of the performance (flops per byte) attained by your real kernel. My question for you is: What fraction would it take for you to never write a complex kernel ever again? (That is, to simply use the magic box.)  80%?... 90%? %?  Remember: the magic box is effortless 60

Final thought Put another way, how much would you pay for a magic box if that fraction were always 100%? 61

Final thought Put another way, how much would you pay for a magic box if that fraction were always 100%? What would this kind of productivity be worth to you and your developers? 62

Final thought Put another way, how much would you pay for a magic box if that fraction were always 100%? What would this kind of productivity be worth to you and your developers? Think about it! 63

64 Further information Website:  Discussion:   Contact: 