1 Anatomy of a High- Performance Many-Threaded Matrix Multiplication Tyler M. Smith, Robert A. van de Geijn, Mikhail Smelyanskiy, Jeff Hammond, Field G. Van Zee
Introduction Shared memory parallelism for GEMM Many-threaded architectures require more sophisticated methods of parallelism Explore the opportunities for parallelism to explain which we will exploit Need finer grain parallelism 2
Outline GotoBLAS approach Opportunities for Parallelism Many-threaded Results 3
GotoBLAS Approach 4 += A BC m m nk k n The GEMM operation:
Main Memory L3 cache L2 cache += L1 cache registers
6 Main Memory L3 cache L2 cache += L1 cache registers ncnc ncnc
7 Main Memory L3 cache L2 cache += L1 cache registers kckc kckc
8 Main Memory L3 cache L2 cache += L1 cache registers mcmc mcmc
9 Main Memory L3 cache L2 cache += L1 cache registers nrnr nrnr nrnr
10 Main Memory L3 cache L2 cache += L1 cache registers mrmr mrmr
Outline GotoBLAS approach Opportunities for Parallelism Many-threaded Results 11
3 Loops to Parallelize in GotoBLAS 12 +=
5 Opportunities for Parallelism 13 +=
Multiple Levels of Parallelism 14 irir += All threads share micro-panel of B Each thread has its own micro-panel of A Fixed number of iterations:
Multiple Levels of Parallelism 15 += jrjr jrjr All threads share block of A Each thread has its own micro-panel of B Fixed number of iterations Good if shared L2 cache
Multiple Levels of Parallelism 16 All threads share panel of B Each thread has its own block of A Number of iterations is not fixed Good if multiple L2 caches
Multiple Levels of Parallelism 17 Each iteration updates entire C Iterations of the loop are not independent Requires mutex when updating C Or a reduction
Multiple Levels of Parallelism Each iteration updates entire C Iterations of the loop are not independent Requires mutex when updating C Or a reduction
Multiple Levels of Parallelism 19 All threads share matrix A Each thread has its own panel of B Number of iterations is not fixed Good if multiple L3 caches Good for NUMA reasons
Outline GotoBLAS approach Opportunities for Parallelism Many-threaded Results 20
Intel Xeon Phi Many Threads 60 cores, 4 threads per core Need to use > 2 threads per core to utilize FPU We do not block for the L1 cache Difficult to amortize the cost of updating C with 4 threads sharing an L1 cache We consider part of the L2 cache as a virtual L1 Each core has its own L2 cache 21
22
23
24
25
IBM Blue Gene/Q (Not quite as) Many Threads 16 cores, 4 threads per core Need to use > 2 threads per core to utilize FPU We do not block for the L1 cache Difficult to amortize the cost of updating C with 4 threads sharing an L1 cache We consider part of the L2 cache as a virtual L1 Single large, shared L2 cache 26
27
28
29
30
31
32
Thank You Questions? Source code available at: code.google.com/p/blis/ 33