Download presentation
Presentation is loading. Please wait.
Published byLinda Lucas Modified over 9 years ago
1
Presented by PLASMA (Parallel Linear Algebra for Scalable Multicore Architectures) The Innovative Computing Laboratory University of Tennessee Knoxville
2
2 Dongarra_PLASMA_SC07 2 Dongarra_KOJAK_SC07 Why Multicore? The ILP Wall The Memory Wall The Power Wall ILP is expensive and does not generate much concurrency TLP provides higher concurrency Power consumption grows with MHz 3 Latency is completely exposed Latency can be hidden using multithreading Single core Multicore Power consumption grows linearly with the number of transistors
3
3 Dongarra_PLASMA_SC07 3 Dongarra_KOJAK_SC07 Programming for Multicores Parallel software for multicores should have two characteristics: fine granularity: high parallelism degree is needed cores are (and probably will be) associated with relatively small local memories. asynchronicity: high parallelism degree make synchronizations a bigger bottleneck hide the latency
4
4 Dongarra_PLASMA_SC07 4 Dongarra_KOJAK_SC07 Developing Parallel Algorithms LAPACK Threaded BLAS PThreadsOpenMP parallelism s LAPACK sequential BLAS PThreadsOpenMP parallelism sequential BLAS
5
5 Dongarra_PLASMA_SC07 5 Dongarra_KOJAK_SC07 Developing Parallel Algorithms: why? BLAS2 operations cannot be efficiently parallelized because they are bandwidth bound. strict synchronizations poor parallelism poor scalability
6
6 Dongarra_PLASMA_SC07 6 Dongarra_KOJAK_SC07 Tiled Cholesky Factorization In some cases it is possible to use the LAPACK algorithm breaking the elementary operations into tiles. Cholesky: do DPOTF2 on for all do DTRSM on end for all do DGEMM on end
7
7 Dongarra_PLASMA_SC07 7 Dongarra_KOJAK_SC07 Tiled LU Factorization In many cases different algorithms are needed which must be invented or can be found in literature. LU and QR: DTSTRF: DGETRF: DGESSM: DSSSSM:
8
8 Dongarra_PLASMA_SC07 8 Dongarra_KOJAK_SC07 Tiled LU Factorization In many cases different algorithms are needed which must be invented or can be found in literature. LU and QR:
9
9 Dongarra_PLASMA_SC07 9 Dongarra_KOJAK_SC07 k=1 DGETRF k=1, j=2 DGESSM k=1, j=3 DGESSM k=1, i=2 DTSTRF k=1, i=2, j=2 DSSSSM k=1, i=2, j=3 DSSSSM k=1, i=3 DTSTRF k=1, i=3, j=2 DSSSSM k=1, i=3, j=3 DSSSSM Tiled LU Factorization
10
10 Dongarra_PLASMA_SC07 10 Dongarra_KOJAK_SC07 Block Data Layout Fine granularity may require novel data formats to overcome the limitations of BLAS on small chunks of data. Column-Major Block data layout
11
11 Dongarra_PLASMA_SC07 11 Dongarra_KOJAK_SC07 Graph Driven Asynchronous Execution The whole factorization can be represented as a DAG: nodes: tasks that operate on tiles edges: dependencies among tasks Tasks can be scheduled asynchronously and in any order as long as dependencies are not violated. DTSTRF DGETRF DGESSM DSSSSM
12
12 Dongarra_PLASMA_SC07 12 Dongarra_KOJAK_SC07 A critical path can be defined as the shortest path that connects all the nodes with the higher number of outgoing edges. Graph Driven Asynchronous Execution Priorities:
13
13 Dongarra_PLASMA_SC07 13 Dongarra_KOJAK_SC07 Fork-Join vs Asynchronous Time Idle time Fork-Join Asynchronous
14
14 Dongarra_PLASMA_SC07 14 Dongarra_KOJAK_SC07 Performance: Cholesky
15
15 Dongarra_PLASMA_SC07 15 Dongarra_KOJAK_SC07 Performance: QR
16
16 Dongarra_PLASMA_SC07 16 Dongarra_KOJAK_SC07 Performance: LU
17
17 Dongarra_PLASMA_SC07 17 Dongarra_KOJAK_SC07 Contacts http://icl.cs.utk.edu/~buttari http://www-math.cudenver.edu/~langou http://icl.cs.utk.edu/~kurzak http://netlib.org/utk/people/JackDongarra http://icl.cs.utk.edu
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.