THE UNIVERSITY OF TEXAS AT AUSTIN Programming Dense Matrix Computations Using Distributed and Off-Chip Shared-Memory on Many-Core Architectures Ernie Chan.

Slides:



Advertisements
Similar presentations
MPI Message Passing Interface
Advertisements

A Coherent and Managed Runtime for ML on the SCC KC SivaramakrishnanLukasz Ziarek Suresh Jagannathan Purdue University SUNY Buffalo Purdue University.
1 Introduction to Collective Operations in MPI l Collective operations are called by all processes in a communicator. MPI_BCAST distributes data from one.
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
Parallel Processing1 Parallel Processing (CS 667) Lecture 9: Advanced Point to Point Communication Jeremy R. Johnson *Parts of this lecture was derived.
MPI Collective Communications
1 Collective Operations Dr. Stephen Tse Lesson 12.
Sahalu Junaidu ICS 573: High Performance Computing 8.1 Topic Overview Matrix-Matrix Multiplication Block Matrix Operations A Simple Parallel Matrix-Matrix.
Concurrent Data Structures in Architectures with Limited Shared Memory Support Ivan Walulya Yiannis Nikolakopoulos Marina Papatriantafilou Philippas Tsigas.
A Framework for Distributed Tensor Computations Martin Schatz Bryan Marker Robert van de Geijn The University of Texas at Austin Tze Meng Low Carnegie.
April 19, 2010HIPS Transforming Linear Algebra Libraries: From Abstraction to Parallelism Ernie Chan.
1 Friday, October 20, 2006 “Work expands to fill the time available for its completion.” -Parkinson’s 1st Law.
A Message Passing Standard for MPP and Workstations Communications of the ACM, July 1996 J.J. Dongarra, S.W. Otto, M. Snir, and D.W. Walker.
SOME BASIC MPI ROUTINES With formal datatypes specified.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
E.Papandrea 06/11/2003 DFCI COMPUTING - HW REQUIREMENTS1 Enzo Papandrea COMPUTING HW REQUIREMENT.
Distributed Systems CS Programming Models- Part II Lecture 17, Nov 2, 2011 Majd F. Sakr, Mohammad Hammoud andVinay Kolar 1.
Today Objectives Chapter 6 of Quinn Creating 2-D arrays Thinking about “grain size” Introducing point-to-point communications Reading and printing 2-D.
Parallel Programming with Java
High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige.
ET E.T. International, Inc. X-Stack: Programming Challenges, Runtime Systems, and Tools Brandywine Team May2013.
Collective Communication
Collective Communication on Architectures that Support Simultaneous Communication over Multiple Links Ernie Chan.
L15: Putting it together: N-body (Ch. 6) October 30, 2012.
Parallel Programming and Algorithms – MPI Collective Operations David Monismith CS599 Feb. 10, 2015 Based upon MPI: A Message-Passing Interface Standard.
ECE 1747H : Parallel Programming Message Passing (MPI)
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Part I MPI from scratch. Part I By: Camilo A. SilvaBIOinformatics Summer 2008 PIRE :: REU :: Cyberbridges.
Course Wrap-Up Miodrag Bolic CEG4136. What was covered Interconnection network topologies and performance Shared-memory architectures Message passing.
 Collectives on Two-tier Direct Networks EuroMPI – 2012 Nikhil Jain, JohnMark Lau, Laxmikant Kale 26 th September, 2012.
Enabling Multi-threaded Applications on Hybrid Shared Memory Manycore Architectures Tushar Rawat and Aviral Shrivastava Arizona State University, USA CML.
Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.
Parallel Programming with MPI Prof. Sivarama Dandamudi School of Computer Science Carleton University.
A Profiler for a Multi-Core Multi-FPGA System by Daniel Nunes Supervisor: Professor Paul Chow September 30 th, 2008 University of Toronto Electrical and.
Distributed-Memory (Message-Passing) Paradigm FDI 2004 Track M Day 2 – Morning Session #1 C. J. Ribbens.
MPI (continue) An example for designing explicit message passing programs Advanced MPI concepts.
PACC2011, Sept [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG.
SOFTWARE DESIGN. INTRODUCTION There are 3 distinct types of activities in design 1.External design 2.Architectural design 3.Detailed design Architectural.
Parallel Programming with MPI By, Santosh K Jena..
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
SuperMatrix on Heterogeneous Platforms Jianyu Huang SHPC, UT Austin 1.
Parco Auto-optimization of linear algebra parallel routines: the Cholesky factorization Luis-Pedro García Servicio de Apoyo a la Investigación Tecnológica.
1 "Workshop 31: Developing a Hands-on Undergraduate Parallel Programming Course with Pattern Programming SIGCSE The 44 th ACM Technical Symposium.
Thinking in Parallel – Implementing In Code New Mexico Supercomputing Challenge in partnership with Intel Corp. and NM EPSCoR.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture FIT5174 Distributed & Parallel Systems Lecture 5 Message Passing and MPI.
R UNTIME D ATA F LOW S CHEDULING OF M ATRIX C OMPUTATIONS E RNIE C HAN R UNTIME D ATA F LOW S CHEDULING OF M ATRIX C OMPUTATIONS E RNIE C HAN C HOL 0 A.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 8 October 23, 2002 Nayda G. Santiago.
On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de.
May 31, 2010Final defense1 Application of Dependence Analysis and Runtime Data Flow Graph Scheduling to Matrix Computations Ernie Chan.
Parallel Algorithms & Implementations: Data-Parallelism, Asynchronous Communication and Master/Worker Paradigm FDI 2007 Track Q Day 2 – Morning Session.
Message Passing Programming Based on MPI Collective Communication I Bora AKAYDIN
Message Passing Interface Using resources from
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
March 22, 2010Intel talk1 Runtime Data Flow Graph Scheduling of Matrix Computations Ernie Chan.
COMP7330/7336 Advanced Parallel and Distributed Computing MPI Programming: 1. Collective Operations 2. Overlapping Communication with Computation Dr. Xiao.
June 13-15, 2010SPAA Managing the Complexity of Lookahead for LU Factorization with Pivoting Ernie Chan.
June 9-11, 2007SPAA SuperMatrix Out-of-Order Scheduling of Matrix Operations for SMP and Multi-Core Architectures Ernie Chan The University of Texas.
Stanford University.
Introduction to MPI Programming Ganesh C.N.
TI Information – Selective Disclosure
Pattern Parallel Programming
Coding FLAME Algorithms with Example: Cholesky factorization
Parallel Programming with MPI and OpenMP
More on MPI Nonblocking point-to-point routines Deadlock
Distributed Systems CS
Yiannis Nikolakopoulos
CSCE569 Parallel Computing
More on MPI Nonblocking point-to-point routines Deadlock
Synchronizing Computations
Presentation transcript:

THE UNIVERSITY OF TEXAS AT AUSTIN Programming Dense Matrix Computations Using Distributed and Off-Chip Shared-Memory on Many-Core Architectures Ernie Chan

48 cores in 6×4 mesh with 2 cores per tile 4 DDR3 memory controllers How to Program SCC? November 3, 2010Vienna talk2 Memory Controller Tile R R R R R R R R R R R R R R R R R R R R R R R Memory Controller System I/F Tile Core 1 Core 0 L2$1 L2$0 Router MPB Core 1 Core 0 R

Outline How to Program SCC? Elemental Collective Communication Off-Chip Shared-Memory Conclusion November 3, 2010Vienna talk3

Elemental New, Modern Distributed-Memory Dense Linear Algebra Library – Replacement for PLAPACK and ScaLAPACK – Object-oriented data structures for matrices – Coded in C++ – Torus-wrap/elemental mapping of matrices to a two-dimensional process grid – Implemented entirely using bulk synchronous communication November 3, 2010Vienna talk4

Elemental November 3, 2010Vienna talk

Elemental November 3, 2010Vienna talk

Elemental November 3, 2010Vienna talk

Elemental Redistributing the Matrix Over a Process Grid – Collective communication November 3, 2010Vienna talk8

Outline How to Program SCC? Elemental Collective Communication Off-Chip Shared-Memory Conclusion November 3, 2010Vienna talk9

Collective Communication RCCE Message Passing API – Blocking send and receive int RCCE_send( char *buf, size_t num, int dest ); int RCCE_recv( char *buf, size_t num, int src ); – Potential for deadlock November 3, 2010Vienna talk

Collective Communication Avoiding Deadlock – Even number of cores in cycle November 3, 2010Vienna talk

Collective Communication Avoiding Deadlock – Odd number of cores in cycle November 3, 2010Vienna talk

Collective Communication Broadcast int RCCE_bcast( char *buf, size_t num, int root, RCCE_COMM comm ); November 3, 2010Vienna talk13 Before

Collective Communication Broadcast int RCCE_bcast( char *buf, size_t num, int root, RCCE_COMM comm ); November 3, 2010Vienna talk14 After

Collective Communication Reduce int RCCE_reduce( char *inbuf, char *outbuf, int num, int type, int op, int root, RCCE_COMM comm ); November 3, 2010Vienna talk15 Before

Collective Communication Reduce int RCCE_reduce( char *inbuf, char *outbuf, int num, int type, int op, int root, RCCE_COMM comm ); November 3, 2010Vienna talk16 After

Collective Communication Gather int RCCE_gather( char *inbuf, char *outbuf, size_t num, int root, RCCE_COMM comm ); November 3, 2010Vienna talk17 Before

Collective Communication Gather int RCCE_gather( char *inbuf, char *outbuf, size_t num, int root, RCCE_COMM comm ); November 3, 2010Vienna talk18 After

Collective Communication Scatter int RCCE_scatter( char *inbuf, char *outbuf, size_t num, int root, RCCE_COMM comm ); November 3, 2010Vienna talk19 Before

Collective Communication Scatter int RCCE_scatter( char *inbuf, char *outbuf, size_t num, int root, RCCE_COMM comm ); November 3, 2010Vienna talk20 After

Collective Communication Allgather int RCCE_allgather( char *inbuf, char *outbuf, size_t num, RCCE_COMM comm ); November 3, 2010Vienna talk21 Before

Collective Communication Allgather int RCCE_allgather( char *inbuf, char *outbuf, size_t num, RCCE_COMM comm ); November 3, 2010Vienna talk22 After

Collective Communication Reduce-Scatter int RCCE_reduce_scatter( char *inbuf, char *outbuf, int *counts, int type, int op, RCCE_COMM comm ); November 3, 2010Vienna talk23 Before

Collective Communication Reduce-Scatter int RCCE_reduce_scatter( char *inbuf, char *outbuf, int *counts, int type, int op, RCCE_COMM comm ); November 3, 2010Vienna talk After

Collective Communication Allreduce int RCCE_allreduce( char *inbuf, char *outbuf, int num, int type, int op, RCCE_COMM comm ); November 3, 2010Vienna talk25 Before

Collective Communication Allreduce int RCCE_allreduce( char *inbuf, char *outbuf, int num, int type, int op, RCCE_COMM comm ); November 3, 2010Vienna talk26 After

Collective Communication Alltoall int RCCE_alltoall( char *inbuf, char *outbuf, size_t num, RCCE_COMM comm ); November 3, 2010Vienna talk Before

Collective Communication Alltoall int RCCE_alltoall( char *inbuf, char *outbuf, size_t num, RCCE_COMM comm ); November 3, 2010Vienna talk28 After

Collective Communication SendRecv int RCCE_sendrecv( char *inbuf, size_t innum, int dest, char *outbuf, size_t outnum, int src, RCCE_COMM comm ); – A send call and a receive call combined into a single operation – Passing -1 as the rank to dest or src will result in the corresponding communication not to occur – Implemented as a permutation November 3, 2010Vienna talk29

Collective Communication Minimum Spanning Tree Algorithm – Scatter November 3, 2010Vienna talk30

Collective Communication Minimum Spanning Tree Algorithm – Scatter November 3, 2010Vienna talk31

Collective Communication Minimum Spanning Tree Algorithm – Scatter November 3, 2010Vienna talk32

Collective Communication Minimum Spanning Tree Algorithm – Scatter November 3, 2010Vienna talk33

Collective Communication Cyclic (Bucket) Algorithm – Allgather November 3, 2010Vienna talk34

Collective Communication Cyclic (Bucket) Algorithm – Allgather November 3, 2010Vienna talk35

Collective Communication Cyclic (Bucket) Algorithm – Allgather November 3, 2010Vienna talk36

Collective Communication Cyclic (Bucket) Algorithm – Allgather November 3, 2010Vienna talk37

Collective Communication Cyclic (Bucket) Algorithm – Allgather November 3, 2010Vienna talk38

Collective Communication Cyclic (Bucket) Algorithm – Allgather November 3, 2010Vienna talk39

Collective Communication November 3, 2010Vienna talk40

Elemental November 3, 2010Vienna talk41

Elemental November 3, 2010Vienna talk42

Elemental November 3, 2010Vienna talk43

Elemental November 3, 2010Vienna talk44

Elemental November 3, 2010Vienna talk45

Outline How to Program SCC? Elemental Collective Communication Off-Chip Shared-Memory Conclusion November 3, 2010Vienna talk46

Off-Chip Shared-Memory Distributed vs. Shared-Memory November 3, 2010Vienna talk47 Memory Controller Tile R R R R R R R R R R R R R R R R R R R R R R R R Memory Controller System I/F DistributedMemory Shared-Memory

Off-Chip Shared-Memory SuperMatrix – Map dense matrix computation to a directed acyclic graph – No matrix distribution – Store DAG and matrix on off-chip shared- memory November 3, 2010Vienna talk48 CHOL 0 TRSM 2 TRSM 1 SYRK 5 GEMM 4 SYRK 3 CHOL 6 TRSM 7 SYRK 8 CHOL 9

Off-Chip Shared-Memory Non-cacheable vs. Cacheable Shared-Memory – Non-cacheable Allow for a simple programming interface Poor performance – Cacheable Need software managed cache coherency mechanism Execute on data stored in cache Interleave distributed and shared-memory programming concepts November 3, 2010Vienna talk49

Off-Chip Shared-Memory November 3, 2010Vienna talk50

SuperMatrix November 3, 2010Vienna talk51 Cholesky Factorization – Iteration 1 CHOL 0 Chol( A 0,0 )

SuperMatrix November 3, 2010Vienna talk52 Cholesky Factorization – Iteration 1 CHOL 0 TRSM 2 TRSM 1 CHOL 0 Chol( A 0,0 ) TRSM 1 A 1,0 A 0,0 -T TRSM 2 A 2,0 A 0,0 -T

SuperMatrix November 3, 2010Vienna talk53 Cholesky Factorization – Iteration 1 CHOL 0 TRSM 2 TRSM 1 SYRK 5 GEMM 4 SYRK 3 CHOL 0 Chol( A 0,0 ) TRSM 1 A 1,0 A 0,0 -T SYRK 3 A 1,1 – A 1,0 A 1,0 T TRSM 2 A 2,0 A 0,0 -T SYRK 5 A 2,2 – A 2,0 A 2,0 T GEMM 4 A 2,1 – A 2,0 A 1,0 T

SuperMatrix November 3, 2010Vienna talk54 Cholesky Factorization – Iteration 2 SYRK 8 A 2,2 – A 2,1 A 2,1 T TRSM 7 A 2,1 A 1,1 -T CHOL 0 TRSM 2 TRSM 1 SYRK 5 GEMM 4 SYRK 3 CHOL 6 TRSM 7 SYRK 8 CHOL 6 Chol( A 1,1 )

SuperMatrix November 3, 2010Vienna talk55 Cholesky Factorization – Iteration 3 CHOL 0 TRSM 2 TRSM 1 SYRK 5 GEMM 4 SYRK 3 CHOL 6 TRSM 7 SYRK 8 CHOL 9 Chol( A 2,2 )

Outline How to Program SCC? Elemental Collective Communication Off-Chip Shared-Memory Conclusion November 3, 2010Vienna talk56

Conclusion Distributed vs. Shared-Memory – Elemental vs. SuperMatrix? A Collective Communication Library for SCC – RCCE_comm : released under LGPL and available on the public Intel SCC software repository rcce_applications/UT/RCCE_comm/ November 3, 2010Vienna talk57

Acknowledgments We thank the other members of the FLAME team for their support – Bryan Marker, Jack Poulson, and Robert van de Geijn We thank Intel for access to SCC and their help – Timothy G. Mattson and Rob F. Van Der Wijngaart Funding – Intel Corporation – National Science Foundation November 3, 2010Vienna talk58

Conclusion November 3, 2010Vienna talk59 More Information Questions?