1 Anatomy of a High- Performance Many-Threaded Matrix Multiplication Tyler M. Smith, Robert A. van de Geijn, Mikhail Smelyanskiy, Jeff Hammond, Field G.

1 Anatomy of a High- Performance Many-Threaded Matrix Multiplication Tyler M. Smith, Robert A. van de Geijn, Mikhail Smelyanskiy, Jeff Hammond, Field G. Van Zee

Introduction Shared memory parallelism for GEMM Many-threaded architectures require more sophisticated methods of parallelism Explore the opportunities for parallelism to explain which we will exploit Need finer grain parallelism 2

Outline  GotoBLAS approach Opportunities for Parallelism Many-threaded Results 3

GotoBLAS Approach 4 += A BC m m nk k n The GEMM operation:

Main Memory L3 cache L2 cache += L1 cache registers

6 Main Memory L3 cache L2 cache += L1 cache registers ncnc ncnc

7 Main Memory L3 cache L2 cache += L1 cache registers kckc kckc

8 Main Memory L3 cache L2 cache += L1 cache registers mcmc mcmc

9 Main Memory L3 cache L2 cache += L1 cache registers nrnr nrnr nrnr

10 Main Memory L3 cache L2 cache += L1 cache registers mrmr mrmr

Outline GotoBLAS approach  Opportunities for Parallelism Many-threaded Results 11

3 Loops to Parallelize in GotoBLAS 12 +=

5 Opportunities for Parallelism 13 +=

Multiple Levels of Parallelism 14 irir += All threads share micro-panel of B Each thread has its own micro-panel of A Fixed number of iterations:

Multiple Levels of Parallelism 15 += jrjr jrjr All threads share block of A Each thread has its own micro-panel of B Fixed number of iterations Good if shared L2 cache

Multiple Levels of Parallelism 16 All threads share panel of B Each thread has its own block of A Number of iterations is not fixed Good if multiple L2 caches

Multiple Levels of Parallelism 17 Each iteration updates entire C Iterations of the loop are not independent Requires mutex when updating C Or a reduction

Multiple Levels of Parallelism Each iteration updates entire C Iterations of the loop are not independent Requires mutex when updating C Or a reduction

Multiple Levels of Parallelism 19 All threads share matrix A Each thread has its own panel of B Number of iterations is not fixed Good if multiple L3 caches Good for NUMA reasons

Outline GotoBLAS approach Opportunities for Parallelism  Many-threaded Results 20

Intel Xeon Phi Many Threads  60 cores, 4 threads per core  Need to use > 2 threads per core to utilize FPU We do not block for the L1 cache  Difficult to amortize the cost of updating C with 4 threads sharing an L1 cache  We consider part of the L2 cache as a virtual L1 Each core has its own L2 cache 21

IBM Blue Gene/Q (Not quite as) Many Threads  16 cores, 4 threads per core  Need to use > 2 threads per core to utilize FPU We do not block for the L1 cache  Difficult to amortize the cost of updating C with 4 threads sharing an L1 cache  We consider part of the L2 cache as a virtual L1 Single large, shared L2 cache 26

Thank You Questions? Source code available at:  code.google.com/p/blis/ 33

1 Anatomy of a High- Performance Many-Threaded Matrix Multiplication Tyler M. Smith, Robert A. van de Geijn, Mikhail Smelyanskiy, Jeff Hammond, Field G.

Similar presentations

Presentation on theme: "1 Anatomy of a High- Performance Many-Threaded Matrix Multiplication Tyler M. Smith, Robert A. van de Geijn, Mikhail Smelyanskiy, Jeff Hammond, Field G."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Anatomy of a High- Performance Many-Threaded Matrix Multiplication Tyler M. Smith, Robert A. van de Geijn, Mikhail Smelyanskiy, Jeff Hammond, Field G.

Similar presentations

Presentation on theme: "1 Anatomy of a High- Performance Many-Threaded Matrix Multiplication Tyler M. Smith, Robert A. van de Geijn, Mikhail Smelyanskiy, Jeff Hammond, Field G."— Presentation transcript:

Similar presentations

About project

Feedback