Download presentation
Presentation is loading. Please wait.
Published byDawson Plume Modified over 9 years ago
1
1 Anatomy of a High- Performance Many-Threaded Matrix Multiplication Tyler M. Smith, Robert A. van de Geijn, Mikhail Smelyanskiy, Jeff Hammond, Field G. Van Zee
2
Introduction Shared memory parallelism for GEMM Many-threaded architectures require more sophisticated methods of parallelism Explore the opportunities for parallelism to explain which we will exploit Need finer grain parallelism 2
3
Outline GotoBLAS approach Opportunities for Parallelism Many-threaded Results 3
4
GotoBLAS Approach 4 += A BC m m nk k n The GEMM operation:
5
Main Memory L3 cache L2 cache += L1 cache registers
6
6 Main Memory L3 cache L2 cache += L1 cache registers ncnc ncnc
7
7 Main Memory L3 cache L2 cache += L1 cache registers kckc kckc
8
8 Main Memory L3 cache L2 cache += L1 cache registers mcmc mcmc
9
9 Main Memory L3 cache L2 cache += L1 cache registers nrnr nrnr nrnr
10
10 Main Memory L3 cache L2 cache += L1 cache registers mrmr mrmr
11
Outline GotoBLAS approach Opportunities for Parallelism Many-threaded Results 11
12
3 Loops to Parallelize in GotoBLAS 12 +=
13
5 Opportunities for Parallelism 13 +=
14
Multiple Levels of Parallelism 14 irir += All threads share micro-panel of B Each thread has its own micro-panel of A Fixed number of iterations:
15
Multiple Levels of Parallelism 15 += jrjr jrjr All threads share block of A Each thread has its own micro-panel of B Fixed number of iterations Good if shared L2 cache
16
Multiple Levels of Parallelism 16 All threads share panel of B Each thread has its own block of A Number of iterations is not fixed Good if multiple L2 caches
17
Multiple Levels of Parallelism 17 Each iteration updates entire C Iterations of the loop are not independent Requires mutex when updating C Or a reduction
18
Multiple Levels of Parallelism Each iteration updates entire C Iterations of the loop are not independent Requires mutex when updating C Or a reduction
19
Multiple Levels of Parallelism 19 All threads share matrix A Each thread has its own panel of B Number of iterations is not fixed Good if multiple L3 caches Good for NUMA reasons
20
Outline GotoBLAS approach Opportunities for Parallelism Many-threaded Results 20
21
Intel Xeon Phi Many Threads 60 cores, 4 threads per core Need to use > 2 threads per core to utilize FPU We do not block for the L1 cache Difficult to amortize the cost of updating C with 4 threads sharing an L1 cache We consider part of the L2 cache as a virtual L1 Each core has its own L2 cache 21
22
22
23
23
24
24
25
25
26
IBM Blue Gene/Q (Not quite as) Many Threads 16 cores, 4 threads per core Need to use > 2 threads per core to utilize FPU We do not block for the L1 cache Difficult to amortize the cost of updating C with 4 threads sharing an L1 cache We consider part of the L2 cache as a virtual L1 Single large, shared L2 cache 26
27
27
28
28
29
29
30
30
31
31
32
32
33
Thank You Questions? Source code available at: code.google.com/p/blis/ 33
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.