Download presentation
Presentation is loading. Please wait.
Published byGordon Jackson Modified over 9 years ago
1
THE UNIVERSITY OF TEXAS AT AUSTIN Programming Dense Matrix Computations Using Distributed and Off-Chip Shared-Memory on Many-Core Architectures Ernie Chan
2
48 cores in 6×4 mesh with 2 cores per tile 4 DDR3 memory controllers How to Program SCC? November 3, 2010Vienna talk2 Memory Controller Tile R R R R R R R R R R R R R R R R R R R R R R R Memory Controller System I/F Tile Core 1 Core 0 L2$1 L2$0 Router MPB Core 1 Core 0 R
3
Outline How to Program SCC? Elemental Collective Communication Off-Chip Shared-Memory Conclusion November 3, 2010Vienna talk3
4
Elemental New, Modern Distributed-Memory Dense Linear Algebra Library – Replacement for PLAPACK and ScaLAPACK – Object-oriented data structures for matrices – Coded in C++ – Torus-wrap/elemental mapping of matrices to a two-dimensional process grid – Implemented entirely using bulk synchronous communication November 3, 2010Vienna talk4
5
Elemental November 3, 2010Vienna talk5 0 0 2 2 4 4 1 1 3 3 5 5
6
Elemental November 3, 2010Vienna talk6 0 0 2 2 4 4 1 1 3 3 5 5
7
Elemental November 3, 2010Vienna talk7 0 0 2 2 4 4 1 1 3 3 5 5
8
Elemental Redistributing the Matrix Over a Process Grid – Collective communication November 3, 2010Vienna talk8
9
Outline How to Program SCC? Elemental Collective Communication Off-Chip Shared-Memory Conclusion November 3, 2010Vienna talk9
10
Collective Communication RCCE Message Passing API – Blocking send and receive int RCCE_send( char *buf, size_t num, int dest ); int RCCE_recv( char *buf, size_t num, int src ); – Potential for deadlock November 3, 2010Vienna talk10 0 0 2 2 4 4 1 1 3 3 5 5
11
Collective Communication Avoiding Deadlock – Even number of cores in cycle November 3, 2010Vienna talk11 0 0 2 2 4 4 1 1 3 3 5 5 0 0 2 2 4 4 1 1 3 3 5 5
12
Collective Communication Avoiding Deadlock – Odd number of cores in cycle November 3, 2010Vienna talk12 0 0 2 2 4 4 1 1 3 3 0 0 2 2 4 4 1 1 3 3 0 0 2 2 4 4 1 1 3 3
13
Collective Communication Broadcast int RCCE_bcast( char *buf, size_t num, int root, RCCE_COMM comm ); November 3, 2010Vienna talk13 Before
14
Collective Communication Broadcast int RCCE_bcast( char *buf, size_t num, int root, RCCE_COMM comm ); November 3, 2010Vienna talk14 After
15
Collective Communication Reduce int RCCE_reduce( char *inbuf, char *outbuf, int num, int type, int op, int root, RCCE_COMM comm ); November 3, 2010Vienna talk15 Before
16
Collective Communication Reduce int RCCE_reduce( char *inbuf, char *outbuf, int num, int type, int op, int root, RCCE_COMM comm ); November 3, 2010Vienna talk16 After + + + + + +
17
Collective Communication Gather int RCCE_gather( char *inbuf, char *outbuf, size_t num, int root, RCCE_COMM comm ); November 3, 2010Vienna talk17 Before
18
Collective Communication Gather int RCCE_gather( char *inbuf, char *outbuf, size_t num, int root, RCCE_COMM comm ); November 3, 2010Vienna talk18 After
19
Collective Communication Scatter int RCCE_scatter( char *inbuf, char *outbuf, size_t num, int root, RCCE_COMM comm ); November 3, 2010Vienna talk19 Before
20
Collective Communication Scatter int RCCE_scatter( char *inbuf, char *outbuf, size_t num, int root, RCCE_COMM comm ); November 3, 2010Vienna talk20 After
21
Collective Communication Allgather int RCCE_allgather( char *inbuf, char *outbuf, size_t num, RCCE_COMM comm ); November 3, 2010Vienna talk21 Before
22
Collective Communication Allgather int RCCE_allgather( char *inbuf, char *outbuf, size_t num, RCCE_COMM comm ); November 3, 2010Vienna talk22 After
23
Collective Communication Reduce-Scatter int RCCE_reduce_scatter( char *inbuf, char *outbuf, int *counts, int type, int op, RCCE_COMM comm ); November 3, 2010Vienna talk23 Before
24
Collective Communication Reduce-Scatter int RCCE_reduce_scatter( char *inbuf, char *outbuf, int *counts, int type, int op, RCCE_COMM comm ); November 3, 2010Vienna talk24 + + + + + + After
25
Collective Communication Allreduce int RCCE_allreduce( char *inbuf, char *outbuf, int num, int type, int op, RCCE_COMM comm ); November 3, 2010Vienna talk25 Before
26
Collective Communication Allreduce int RCCE_allreduce( char *inbuf, char *outbuf, int num, int type, int op, RCCE_COMM comm ); November 3, 2010Vienna talk26 After + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
27
Collective Communication Alltoall int RCCE_alltoall( char *inbuf, char *outbuf, size_t num, RCCE_COMM comm ); November 3, 2010Vienna talk27 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 Before
28
Collective Communication Alltoall int RCCE_alltoall( char *inbuf, char *outbuf, size_t num, RCCE_COMM comm ); November 3, 2010Vienna talk28 After 0 0 0 0 0 0 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5 5
29
Collective Communication SendRecv int RCCE_sendrecv( char *inbuf, size_t innum, int dest, char *outbuf, size_t outnum, int src, RCCE_COMM comm ); – A send call and a receive call combined into a single operation – Passing -1 as the rank to dest or src will result in the corresponding communication not to occur – Implemented as a permutation November 3, 2010Vienna talk29
30
Collective Communication Minimum Spanning Tree Algorithm – Scatter November 3, 2010Vienna talk30
31
Collective Communication Minimum Spanning Tree Algorithm – Scatter November 3, 2010Vienna talk31
32
Collective Communication Minimum Spanning Tree Algorithm – Scatter November 3, 2010Vienna talk32
33
Collective Communication Minimum Spanning Tree Algorithm – Scatter November 3, 2010Vienna talk33
34
Collective Communication Cyclic (Bucket) Algorithm – Allgather November 3, 2010Vienna talk34
35
Collective Communication Cyclic (Bucket) Algorithm – Allgather November 3, 2010Vienna talk35
36
Collective Communication Cyclic (Bucket) Algorithm – Allgather November 3, 2010Vienna talk36
37
Collective Communication Cyclic (Bucket) Algorithm – Allgather November 3, 2010Vienna talk37
38
Collective Communication Cyclic (Bucket) Algorithm – Allgather November 3, 2010Vienna talk38
39
Collective Communication Cyclic (Bucket) Algorithm – Allgather November 3, 2010Vienna talk39
40
Collective Communication November 3, 2010Vienna talk40
41
Elemental November 3, 2010Vienna talk41
42
Elemental November 3, 2010Vienna talk42
43
Elemental November 3, 2010Vienna talk43
44
Elemental November 3, 2010Vienna talk44
45
Elemental November 3, 2010Vienna talk45
46
Outline How to Program SCC? Elemental Collective Communication Off-Chip Shared-Memory Conclusion November 3, 2010Vienna talk46
47
Off-Chip Shared-Memory Distributed vs. Shared-Memory November 3, 2010Vienna talk47 Memory Controller Tile R R R R R R R R R R R R R R R R R R R R R R R R Memory Controller System I/F DistributedMemory Shared-Memory
48
Off-Chip Shared-Memory SuperMatrix – Map dense matrix computation to a directed acyclic graph – No matrix distribution – Store DAG and matrix on off-chip shared- memory November 3, 2010Vienna talk48 CHOL 0 TRSM 2 TRSM 1 SYRK 5 GEMM 4 SYRK 3 CHOL 6 TRSM 7 SYRK 8 CHOL 9
49
Off-Chip Shared-Memory Non-cacheable vs. Cacheable Shared-Memory – Non-cacheable Allow for a simple programming interface Poor performance – Cacheable Need software managed cache coherency mechanism Execute on data stored in cache Interleave distributed and shared-memory programming concepts November 3, 2010Vienna talk49
50
Off-Chip Shared-Memory November 3, 2010Vienna talk50
51
SuperMatrix November 3, 2010Vienna talk51 Cholesky Factorization – Iteration 1 CHOL 0 Chol( A 0,0 )
52
SuperMatrix November 3, 2010Vienna talk52 Cholesky Factorization – Iteration 1 CHOL 0 TRSM 2 TRSM 1 CHOL 0 Chol( A 0,0 ) TRSM 1 A 1,0 A 0,0 -T TRSM 2 A 2,0 A 0,0 -T
53
SuperMatrix November 3, 2010Vienna talk53 Cholesky Factorization – Iteration 1 CHOL 0 TRSM 2 TRSM 1 SYRK 5 GEMM 4 SYRK 3 CHOL 0 Chol( A 0,0 ) TRSM 1 A 1,0 A 0,0 -T SYRK 3 A 1,1 – A 1,0 A 1,0 T TRSM 2 A 2,0 A 0,0 -T SYRK 5 A 2,2 – A 2,0 A 2,0 T GEMM 4 A 2,1 – A 2,0 A 1,0 T
54
SuperMatrix November 3, 2010Vienna talk54 Cholesky Factorization – Iteration 2 SYRK 8 A 2,2 – A 2,1 A 2,1 T TRSM 7 A 2,1 A 1,1 -T CHOL 0 TRSM 2 TRSM 1 SYRK 5 GEMM 4 SYRK 3 CHOL 6 TRSM 7 SYRK 8 CHOL 6 Chol( A 1,1 )
55
SuperMatrix November 3, 2010Vienna talk55 Cholesky Factorization – Iteration 3 CHOL 0 TRSM 2 TRSM 1 SYRK 5 GEMM 4 SYRK 3 CHOL 6 TRSM 7 SYRK 8 CHOL 9 Chol( A 2,2 )
56
Outline How to Program SCC? Elemental Collective Communication Off-Chip Shared-Memory Conclusion November 3, 2010Vienna talk56
57
Conclusion Distributed vs. Shared-Memory – Elemental vs. SuperMatrix? A Collective Communication Library for SCC – RCCE_comm : released under LGPL and available on the public Intel SCC software repository http://marcbug.scc-dc.com/svn/repository/trunk/ rcce_applications/UT/RCCE_comm/ November 3, 2010Vienna talk57
58
Acknowledgments We thank the other members of the FLAME team for their support – Bryan Marker, Jack Poulson, and Robert van de Geijn We thank Intel for access to SCC and their help – Timothy G. Mattson and Rob F. Van Der Wijngaart Funding – Intel Corporation – National Science Foundation November 3, 2010Vienna talk58
59
Conclusion November 3, 2010Vienna talk59 More Information http://www.cs.utexas.edu/~flame Questions? echan@cs.utexas.edu
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.