The Suzaku Pattern Programming Framework 6th NSF/TCPP Workshop on Parallel and Distributed Computing Education (EduPar-16) 2016 IEEE International Parallel.

Slides:



Advertisements
Similar presentations
Practical techniques & Examples
Advertisements

Partitioning and Divide-and-Conquer Strategies ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, Jan 23, 2013.
EECC756 - Shaaban #1 lec # 8 Spring Synchronous Iteration Iteration-based computation is a powerful method for solving numerical (and some.
Chapter 2: Algorithm Discovery and Design
Message-Passing Programming and MPI CS 524 – High-Performance Computing.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
1 UNC-Charlotte’s Grid Computing “Seeds” framework 1 © 2011 Jeremy Villalobos /B. Wilkinson Fall 2011 Grid computing course. Slides10-1.ppt Modification.
Chapter 2: Algorithm Discovery and Design
Chapter 2: Algorithm Discovery and Design
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Modular Programming Chapter Value and Reference Parameters t Function declaration: void computesumave(float num1, float num2, float& sum, float&
CSCI-455/552 Introduction to High Performance Computing Lecture 18.
1 " Teaching Parallel Design Patterns to Undergraduates in Computer Science” Panel member SIGCSE The 45 th ACM Technical Symposium on Computer Science.
Chapter 2: Algorithm Discovery and Design Invitation to Computer Science, C++ Version, Third Edition.
Invitation to Computer Science, Java Version, Second Edition.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
1 " Teaching Parallel Design Patterns to Undergraduates in Computer Science” Panel member SIGCSE The 45 th ACM Technical Symposium on Computer Science.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
CSCI-455/522 Introduction to High Performance Computing Lecture 4.
1 "Workshop 31: Developing a Hands-on Undergraduate Parallel Programming Course with Pattern Programming SIGCSE The 44 th ACM Technical Symposium.
Message-Passing Computing Chapter 2. Programming Multicomputer Design special parallel programming language –Occam Extend existing language to handle.
Thinking in Parallel – Implementing In Code New Mexico Supercomputing Challenge in partnership with Intel Corp. and NM EPSCoR.
Chapter 2: Algorithm Discovery and Design Invitation to Computer Science.
Suzaku Pattern Programming Framework (a) Structure and low level patterns © 2015 B. Wilkinson Suzaku.pptx Modification date February 22,
ITCS 4/5145 Parallel Computing, UNC-Charlotte, B
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
1 ITCS4145 Parallel Programming B. Wilkinson March 23, hybrid-abw.ppt Hybrid Parallel Programming Introduction.
All-to-All Pattern A pattern where all (slave) processes can communicate with each other Somewhat the worst case scenario! 1 ITCS 4/5145 Parallel Computing,
Information and Computer Sciences University of Hawaii, Manoa
User-Written Functions
Distributed Shared Memory
The Suzaku Pattern Programming Framework
Parallel Programming By J. H. Wang May 2, 2017.
Pattern Parallel Programming
Computer Engg, IIT(BHU)
The University of Adelaide, School of Computer Science
MPI Message Passing Interface
Introduction to OpenMP
Suzaku Pattern Programming Framework Workpool pattern (Version 2)
Arrays & Functions Lesson xx
Using compiler-directed approach to create MPI code automatically
Pattern Parallel Programming
ITCS 4/5145 Parallel Computing, UNC-Charlotte, B
HPML Conference, Lyon, Sept 2018
Pipeline Pattern ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, 2012 slides5.ppt Oct 24, 2013.
Pipelined Pattern This pattern is implemented in Seeds, see
CSCE569 Parallel Computing
Pattern Programming Tools
All-to-All Pattern A pattern where all (slave) processes can communicate with each other Somewhat the worst case scenario! ITCS 4/5145 Parallel Computing,
All-to-All Pattern A pattern where all (slave) processes can communicate with each other Somewhat the worst case scenario! ITCS 4/5145 Parallel Computing,
© B. Wilkinson/Clayton Ferner SIGCSE 2013 Workshop 31 session2a
Introduction to parallelism and the Message Passing Interface
ITCS 4/5145 Parallel Computing, UNC-Charlotte, B
Notes on Assignment 3 OpenMP Stencil Pattern
Programming with Shared Memory
Pipeline Pattern ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, 2012 slides5.ppt March 20, 2014.
Pipeline Pattern ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson slides5.ppt August 17, 2014.
Using compiler-directed approach to create MPI code automatically
Hybrid Parallel Programming
7 Arrays.
Patterns Paraguin Compiler Version 2.1.
Programming with Shared Memory
All-to-All Pattern A pattern where all (slave) processes can communicate with each other Somewhat the worst case scenario! ITCS 4/5145 Parallel Computing,
Matrix Addition and Multiplication
Stencil Pattern ITCS 4/5145 Parallel computing, UNC-Charlotte, B. Wilkinson Jan 28,
Introduction to High Performance Computing Lecture 16
Parallel Programming in C with MPI and OpenMP
Data Parallel Pattern 6c.1
Synchronizing Computations
Presentation transcript:

The Suzaku Pattern Programming Framework 6th NSF/TCPP Workshop on Parallel and Distributed Computing Education (EduPar-16) 2016 IEEE International Parallel and Distributed Processing Symposium Monday May 23rd, 2016 Dr. Barry Wilkinson University of North Carolina Charlotte Dr. Clayton Ferner University of North Carolina Wilmington © 2016 B. Wilkinson/Clayton Ferner Modification date May 12, 2016

2 Suzaku Pattern Programming Framework Developed to create “pattern-based” parallel MPI programs without writing the MPI message-passing code implicit in the patterns. Typeless. Focus is on teaching parallel programming. Some Advantages Better structured programs based upon established parallel design patterns Less error prone. Avoids complexities of MPI routines and simplifies programming. Scalable and maintainable designs.

3 Low-level message passing patterns Point-to-point Broadcast Scatter Gather Allgather Master-slave Higher level computational patterns Static workpool (fixed task queue) Dynamic workpool (variable task queue) Pipeline Stencil All-to-all Generalized pattern (based upon a directed graph )... Parallel Design Patterns

4 Suzaku Program Structure int main (int argc, char **argv ) { int p,... // variables declaration and initialization SZ_Init(p); // initialize environment, sets p to number of processes... SZ_Parallel_begin // parallel section … SZ_Parallel_end;... SZ_Finalize(); return(0); } All variables declared here duplicated in each process. All initializations here apply to all copies of the variables. After call to SZ_Init() only master process executes code, until a parallel section. After SZ_Parallel_begin, all processes execute code, until SZ_Parallel_end Only master process executes code here. Similar to OpenMP but using processes instead of threads.

5 int main(int argc, char *argv[]) { char m[20], n[20]; int p, x, y, xx[5], yy[5]; double a, b, aa[10], bb[10], aaa[2][3], bbb[2][3];... SZ_Init(p);// initialize environment SZ_Parallel_begin // parallel section - from proc 0 to proc 1: SZ_Point_to_point(0, 1, m, n); // send a string SZ_Point_to_point(0, 1, &x, &y); // send an int SZ_Point_to_point(0, 1, &a, &b); // send a double SZ_Point_to_point(0, 1, xx, yy); // send 1-D array of ints SZ_Point_to_point(0, 1, aa, bb); // send 1-D array of doubles SZ_Point_to_point(0, 1, aaa, bbb); // send 2-D array of doubles SZ_Parallel_end;// end of parallel section... SZ_Finalize(); return 0; } Point-to-point Pattern Data size: Does not have to be specified. Data type: Does not have to be specified - char, int, double,1-D arrays of chars, ints, & doubles, & multi- dimensional arrays of doubles allowed.

6 #define N 256 int main (int argc, char *argv[] ) { int i, j, k, p, blksz; double A[N][N], B[N][N], C[N][N], sum; SZ_Init(p);... SZ_Parallel_begin blksz = N/p; // assumes N is a multiple of p double A1[blksz][N], C1[blksz][N]; SZ_Scatter(A,A1); // scatter blksz rows of A array SZ_Broadcast(B); // broadcast B array for (i = 0; i < blksz; i++) // matrix multiplication. for (j = 0; j < N; j++) { sum = 0; for (k = 0; k < N; k++) sum += A1[i][k] * B[k][j]; C1[i][j] = sum; } SZ_Gather(C1,C); // gather results, blksz rows of C1 SZ_Parallel_end;... SZ_Finalize(); return 0; } Matrix Multiplication using Master-slave Pattern with Low-level Patterns Data size - determined by size of destination. Arrays can be static or variable length but not dynamic. Slaves Master

Higher Level Patterns - Workpool Pattern 7 Master Task from task queue Another task if task queue not empty Aggregate answers Slaves/Workers Result Task queue Very widely applicable pattern Master gives each slave a task from task queue Once a slave returns result for a task, it is given another task from task queue, until queue empty Creates load- balancing quality

Workpool Interface init() initializes number of tasks, and size of tasks and results. Three phases: 1. Master sends data to slaves (diffuse) 2. Slaves performs computations (compute) 3. Master gathers results for slaves (gather) 8 Diffuse Compute Gather Slaves Master Message passing done by framework Programmer implements four routines, init(), diffuse(), compute(), and gather().

void init (int *T, int *D, int *R) {... } void diffuse (int *taskID,double output[D]) {... } void compute( int taskID, double input[D], double output[R]) {... } void gather (int taskID, double input[R]) {... } int main(int argc, char *argv[]) { int P; … SZ_Init(P);... SZ_Parallel_begin SZ_Workpool(init, diffuse, compute, gather); SZ_Parallel_end;... // final results SZ_Finalize(); return 0; } 9 Provided by framework Workpool Version 1 Program Structure In version 1, tasks and results limited to 1-D arrays of doubles. By framework Function parameters can be re-named to accommodate multiple workpools.

Workpool Version 2 10 Task data and results can be multiple items of different sizes and types. SZ_put() to pack data into tasks and results SZ_get() to retrieve the data. Size - does not need to be specified. Type - does not need to be specified. Allowed: char, int, double,1-D arrays of chars, ints, & doubles, & multi-dimensional arrays of doubles. Modeled on our earlier Java-based Seeds pattern programming framework but using MPI pack/unpack mechanism internally.

11 void init(int *tasks) { // sets # of tasks *tasks = 4; } void diffuse(int taskID) { char w[] = "Hello World"; int x; double y = 5678, z[2][3]; … SZ_Put("w",w); SZ_Put("x",&x); SZ_Put("y",&y); SZ_Put("z",z); } void compute(int taskID) { char w[12] = " "; int x = 0; double y = 0, z[2][3]; SZ_Get("z",z); SZ_Get("x",&x); SZ_Get("w",w); SZ_Get("y",&y); … // compute SZ_Put("xx",&x); SZ_Put("yy",&y); SZ_Put("zz",z); SZ_Put("ww",w); } Program using Suzaku Workpool Version 2 void gather(int taskID) { char w[12] = " "; int x = 0; double y = 0, z[2][3]; … SZ_Get("ww",w); SZ_Get("zz",z); SZ_Get("xx",&x); SZ_Get("yy",&y); return; } int main(int argc, char *argv[]) { int p; SZ_Init(p); … SZ_Parallel_begin SZ_Workpool2(init,diffuse,compute,gather); SZ_Parallel_end; … SZ_Finalize(); return 0; } Data input and output parameters not needed

Workpool Version 3 “Dynamic Workpool” 12 New tasks can be added to task queue during the computation as might be needed for problems such as the shortest path problem. SZ_Insert_task() provided to add tasks to the task queue. SZ_Put() and SZ_Get() available from version 2 to add data to tasks and results.

13 #define N 6// number of nodes int w[N][N], dist[N], newdist_j; void init(int *T) { … // initialize dist[], w[][] SZ_Master { SZ_Insert_task(0); // insert first node } void diffuse(int taskID) { // put curr. distances SZ_Put("dist",dist);// from array dist[] } void compute(int taskID) { int i, j, new_tasks[N]; SZ_Get("dist",dist); // update dist[] for (i = 0; i < N; i++) new_tasks[i] = 0; i = 0; for (j = 0; j < N; j++) {// Moore’s algorithm if (w[taskID][j] != -1) { newdist_j = dist[taskID] + w[taskID][j]; if (newdist_j < dist[j]) { dist[j] = newdist_j; if (j < N-1) { // do not add last vertex new_tasks[i] = j; i++; } } SZ_Put("result",new_tasks); SZ_Put("dist",dist); } Shortest Path Problem using Dynamic Workpool void gather(int taskID) { int i,dist_recv[N],new_tasks[N]; SZ_Get("result",new_tasks); // get 1st task SZ_Get("dist",dist_recv); for (i = 0; i < N; i++) if (dist_recv[i] < dist[i]) dist[i] = dist_recv[i]; for (i = 0; i < N; i++) if (new_tasks[i] != 0) SZ_Insert_task(new_tasks[i]); } int main(int argc, char *argv[]) { int p; SZ_Init(p); … SZ_Parallel_begin SZ_Workpool3(init,diffuse,compute,gather); SZ_Parallel_end; … // print final results in dist[] SZ_Finalize(); return 0; }

When a pattern is repeated until some termination condition occurs. Synchronization at end of each iteration, to establish termination condition, often a global condition. Note two patterns merged together sequentially if we call iteration a pattern. Iterative Synchronous Patterns Pattern Repeat Check termination condition Stop 14

15 #define N 1// Size of data being sent #define P 4// Number of procs and numbers void init(int *T, int *D) { *T = 4; // number of tasks *D = 1; // number of doubles in each task srand(999); } void diffuse (int taskID, double output[N]) { if (taskID < P) output[0] = rand()% 100; else output[0] = 999; // otherwise terminator } void compute(int taskID, double input[N], double output[N]) { static double largest = 0; if (input[0] > largest) { output[0] = largest; largest = input[0]; } else { output[0] = input[0]; } void gather(int taskID, double input[N]) { if (input[0] == 999) SZ_terminate(); } Sorting using Suzaku Pipeline int main(int argc, char *argv[]) { int p; SZ_Init(p); … SZ_Parallel_begin SZ_Debug(); SZ_Pipeline(init,diffuse,compute,gather); SZ_Parallel_end; … SZ_Finalize(); return 0; } Same interface with init(), diffuse(), compute(), gather(). Repeat Stop Result Slaves Master Check termination condition

16 Suzaku Generalized Pattern Concept Rather than implement every pattern in a unique way, a “connection” graph specifies pattern and location in destination array for messages. Then a “generalized send” routine sends data to all connected processes. Of course one has to avoid messaging deadlock in the Suzaku implementation. Repeat Stop Slaves Defines pattern and destination location Broadcast, scatter, gather Master Check termination condition Connection graph Destination Source

17 Source processes Destination process output[N] input[P][N] output[N] connection_graph[source][destination] -1 No connection x A connection and destination row, i.e. input[x][N]. SZ_Generalized_send(output, input)

Example using Generalized Pattern Iterative Synchronous Stencil Pattern 18 Repeat Stop Check termination condition All nodes can communicate with only neighboring nodes Applications: Solving Laplace’s/heat equation - perform number of iterations to converge on solution.

19 SZ_Parallel_begin// parallel section, all processes do this SZ_Pattern_init("stencil",1);// set up slave interconnections SZ_Broadcast(pts);// initial values in each process …// copy into B[][] for (t = 0; t < T; t++) {// compute values over time T A[0] = 0.25 * (B[0][0]+B[1][0]+B[2][0]+B[3][0]);// computation SZ_Generalized_send(A,B); // sent compute results in A to B } SZ_Gather(A, C);// collect results into C SZ_Parallel_end;// end of parallel Program Segment for 2-dimensional Heat Distribution Problem This example is simplified where each slave handles just one point. Normally each slave would handle a block of points.

20 Possible ways to use pattern approach: Bottom-up approach – OpenMP and MPI first then patterns and Suzaku or Top-down approach – Patterns and Suzaku first and OpenMP and MPI later (or not even learn at all). We found bottoms-up approach best for a full senior undergraduate parallel programming course, after trying both. Lower-level programming courses Top-down approach advantages: Instills good software engineering principles OpenMP and MPI do not need to be covered at all. Can be left for a later course. Classroom Experiences

21 Online senior undergraduate/graduate course. Spring 2015: 65 students, Fall 2015: 62 students, about half undergraduate. VirtualBox VM with software pre-installed for use by students on their own computer. Access to dept. cluster for final testing. Seven 2-week programming assignments. Fifth assignment on Suzaku (astronomical N-body problem). after OpenMP and MPI assignments. Students asked in their assignment report to give evaluation comparing and contrasting using Suzaku with MPI, and to describe their experiences and opinions, and give any suggestions for improvement. UNC-Charlotte Pattern-based Parallel Programming Course, Spring 2015 and Fall 2015

22 Student Feedback Spring 2015: 28 responses, all but one highly positive. Fall 2015: 47 responses, all but three responses highly positive Some comments “easier to use” “user friendly” “less time to write code” “more concise” “using Suzaku was a lot of fun” Disadvantages mentioned: Data type had to be a double (Spring 2015) Need more documentation and understanding Suzaku However, overwhelmingly, students appreciated ease that parallel programs could be constructed with patterns.

Acknowledgements This workshop is funded by the National Science Foundation under grant "Collaborative Research: Teaching Multicore and Many-Core Programming at a Higher Level of Abstraction" # / ( ). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. Questions