THE UNIVERSITY OF TEXAS AT AUSTIN Programming Dense Matrix Computations Using Distributed and Off-Chip Shared-Memory on Many-Core Architectures Ernie Chan
48 cores in 6×4 mesh with 2 cores per tile 4 DDR3 memory controllers How to Program SCC? November 3, 2010Vienna talk2 Memory Controller Tile R R R R R R R R R R R R R R R R R R R R R R R Memory Controller System I/F Tile Core 1 Core 0 L2$1 L2$0 Router MPB Core 1 Core 0 R
Outline How to Program SCC? Elemental Collective Communication Off-Chip Shared-Memory Conclusion November 3, 2010Vienna talk3
Elemental New, Modern Distributed-Memory Dense Linear Algebra Library – Replacement for PLAPACK and ScaLAPACK – Object-oriented data structures for matrices – Coded in C++ – Torus-wrap/elemental mapping of matrices to a two-dimensional process grid – Implemented entirely using bulk synchronous communication November 3, 2010Vienna talk4
Elemental November 3, 2010Vienna talk
Elemental November 3, 2010Vienna talk
Elemental November 3, 2010Vienna talk
Elemental Redistributing the Matrix Over a Process Grid – Collective communication November 3, 2010Vienna talk8
Outline How to Program SCC? Elemental Collective Communication Off-Chip Shared-Memory Conclusion November 3, 2010Vienna talk9
Collective Communication RCCE Message Passing API – Blocking send and receive int RCCE_send( char *buf, size_t num, int dest ); int RCCE_recv( char *buf, size_t num, int src ); – Potential for deadlock November 3, 2010Vienna talk
Collective Communication Avoiding Deadlock – Even number of cores in cycle November 3, 2010Vienna talk
Collective Communication Avoiding Deadlock – Odd number of cores in cycle November 3, 2010Vienna talk
Collective Communication Broadcast int RCCE_bcast( char *buf, size_t num, int root, RCCE_COMM comm ); November 3, 2010Vienna talk13 Before
Collective Communication Broadcast int RCCE_bcast( char *buf, size_t num, int root, RCCE_COMM comm ); November 3, 2010Vienna talk14 After
Collective Communication Reduce int RCCE_reduce( char *inbuf, char *outbuf, int num, int type, int op, int root, RCCE_COMM comm ); November 3, 2010Vienna talk15 Before
Collective Communication Reduce int RCCE_reduce( char *inbuf, char *outbuf, int num, int type, int op, int root, RCCE_COMM comm ); November 3, 2010Vienna talk16 After
Collective Communication Gather int RCCE_gather( char *inbuf, char *outbuf, size_t num, int root, RCCE_COMM comm ); November 3, 2010Vienna talk17 Before
Collective Communication Gather int RCCE_gather( char *inbuf, char *outbuf, size_t num, int root, RCCE_COMM comm ); November 3, 2010Vienna talk18 After
Collective Communication Scatter int RCCE_scatter( char *inbuf, char *outbuf, size_t num, int root, RCCE_COMM comm ); November 3, 2010Vienna talk19 Before
Collective Communication Scatter int RCCE_scatter( char *inbuf, char *outbuf, size_t num, int root, RCCE_COMM comm ); November 3, 2010Vienna talk20 After
Collective Communication Allgather int RCCE_allgather( char *inbuf, char *outbuf, size_t num, RCCE_COMM comm ); November 3, 2010Vienna talk21 Before
Collective Communication Allgather int RCCE_allgather( char *inbuf, char *outbuf, size_t num, RCCE_COMM comm ); November 3, 2010Vienna talk22 After
Collective Communication Reduce-Scatter int RCCE_reduce_scatter( char *inbuf, char *outbuf, int *counts, int type, int op, RCCE_COMM comm ); November 3, 2010Vienna talk23 Before
Collective Communication Reduce-Scatter int RCCE_reduce_scatter( char *inbuf, char *outbuf, int *counts, int type, int op, RCCE_COMM comm ); November 3, 2010Vienna talk After
Collective Communication Allreduce int RCCE_allreduce( char *inbuf, char *outbuf, int num, int type, int op, RCCE_COMM comm ); November 3, 2010Vienna talk25 Before
Collective Communication Allreduce int RCCE_allreduce( char *inbuf, char *outbuf, int num, int type, int op, RCCE_COMM comm ); November 3, 2010Vienna talk26 After
Collective Communication Alltoall int RCCE_alltoall( char *inbuf, char *outbuf, size_t num, RCCE_COMM comm ); November 3, 2010Vienna talk Before
Collective Communication Alltoall int RCCE_alltoall( char *inbuf, char *outbuf, size_t num, RCCE_COMM comm ); November 3, 2010Vienna talk28 After
Collective Communication SendRecv int RCCE_sendrecv( char *inbuf, size_t innum, int dest, char *outbuf, size_t outnum, int src, RCCE_COMM comm ); – A send call and a receive call combined into a single operation – Passing -1 as the rank to dest or src will result in the corresponding communication not to occur – Implemented as a permutation November 3, 2010Vienna talk29
Collective Communication Minimum Spanning Tree Algorithm – Scatter November 3, 2010Vienna talk30
Collective Communication Minimum Spanning Tree Algorithm – Scatter November 3, 2010Vienna talk31
Collective Communication Minimum Spanning Tree Algorithm – Scatter November 3, 2010Vienna talk32
Collective Communication Minimum Spanning Tree Algorithm – Scatter November 3, 2010Vienna talk33
Collective Communication Cyclic (Bucket) Algorithm – Allgather November 3, 2010Vienna talk34
Collective Communication Cyclic (Bucket) Algorithm – Allgather November 3, 2010Vienna talk35
Collective Communication Cyclic (Bucket) Algorithm – Allgather November 3, 2010Vienna talk36
Collective Communication Cyclic (Bucket) Algorithm – Allgather November 3, 2010Vienna talk37
Collective Communication Cyclic (Bucket) Algorithm – Allgather November 3, 2010Vienna talk38
Collective Communication Cyclic (Bucket) Algorithm – Allgather November 3, 2010Vienna talk39
Collective Communication November 3, 2010Vienna talk40
Elemental November 3, 2010Vienna talk41
Elemental November 3, 2010Vienna talk42
Elemental November 3, 2010Vienna talk43
Elemental November 3, 2010Vienna talk44
Elemental November 3, 2010Vienna talk45
Outline How to Program SCC? Elemental Collective Communication Off-Chip Shared-Memory Conclusion November 3, 2010Vienna talk46
Off-Chip Shared-Memory Distributed vs. Shared-Memory November 3, 2010Vienna talk47 Memory Controller Tile R R R R R R R R R R R R R R R R R R R R R R R R Memory Controller System I/F DistributedMemory Shared-Memory
Off-Chip Shared-Memory SuperMatrix – Map dense matrix computation to a directed acyclic graph – No matrix distribution – Store DAG and matrix on off-chip shared- memory November 3, 2010Vienna talk48 CHOL 0 TRSM 2 TRSM 1 SYRK 5 GEMM 4 SYRK 3 CHOL 6 TRSM 7 SYRK 8 CHOL 9
Off-Chip Shared-Memory Non-cacheable vs. Cacheable Shared-Memory – Non-cacheable Allow for a simple programming interface Poor performance – Cacheable Need software managed cache coherency mechanism Execute on data stored in cache Interleave distributed and shared-memory programming concepts November 3, 2010Vienna talk49
Off-Chip Shared-Memory November 3, 2010Vienna talk50
SuperMatrix November 3, 2010Vienna talk51 Cholesky Factorization – Iteration 1 CHOL 0 Chol( A 0,0 )
SuperMatrix November 3, 2010Vienna talk52 Cholesky Factorization – Iteration 1 CHOL 0 TRSM 2 TRSM 1 CHOL 0 Chol( A 0,0 ) TRSM 1 A 1,0 A 0,0 -T TRSM 2 A 2,0 A 0,0 -T
SuperMatrix November 3, 2010Vienna talk53 Cholesky Factorization – Iteration 1 CHOL 0 TRSM 2 TRSM 1 SYRK 5 GEMM 4 SYRK 3 CHOL 0 Chol( A 0,0 ) TRSM 1 A 1,0 A 0,0 -T SYRK 3 A 1,1 – A 1,0 A 1,0 T TRSM 2 A 2,0 A 0,0 -T SYRK 5 A 2,2 – A 2,0 A 2,0 T GEMM 4 A 2,1 – A 2,0 A 1,0 T
SuperMatrix November 3, 2010Vienna talk54 Cholesky Factorization – Iteration 2 SYRK 8 A 2,2 – A 2,1 A 2,1 T TRSM 7 A 2,1 A 1,1 -T CHOL 0 TRSM 2 TRSM 1 SYRK 5 GEMM 4 SYRK 3 CHOL 6 TRSM 7 SYRK 8 CHOL 6 Chol( A 1,1 )
SuperMatrix November 3, 2010Vienna talk55 Cholesky Factorization – Iteration 3 CHOL 0 TRSM 2 TRSM 1 SYRK 5 GEMM 4 SYRK 3 CHOL 6 TRSM 7 SYRK 8 CHOL 9 Chol( A 2,2 )
Outline How to Program SCC? Elemental Collective Communication Off-Chip Shared-Memory Conclusion November 3, 2010Vienna talk56
Conclusion Distributed vs. Shared-Memory – Elemental vs. SuperMatrix? A Collective Communication Library for SCC – RCCE_comm : released under LGPL and available on the public Intel SCC software repository rcce_applications/UT/RCCE_comm/ November 3, 2010Vienna talk57
Acknowledgments We thank the other members of the FLAME team for their support – Bryan Marker, Jack Poulson, and Robert van de Geijn We thank Intel for access to SCC and their help – Timothy G. Mattson and Rob F. Van Der Wijngaart Funding – Intel Corporation – National Science Foundation November 3, 2010Vienna talk58
Conclusion November 3, 2010Vienna talk59 More Information Questions?