Patterns Paraguin Compiler Version 2.1.

Slides:

Advertisements

Similar presentations

Practical techniques & Examples

Advertisements

C Lecture Notes 1 Program Control (Cont...). C Lecture Notes 2 4.8The do / while Repetition Structure The do / while repetition structure –Similar to.

Chapter 6. 2 Objectives You should be able to describe: Function and Parameter Declarations Returning a Single Value Pass by Reference Variable Scope.

12d.1 Two Example Parallel Programs using MPI UNC-Wilmington, C. Ferner, 2007 Mar 209, 2007.

Chapter 8 Arrays and Strings

A Brief Look At MPI’s Point To Point Communication Brian T. Smith Professor, Department of Computer Science Director, Albuquerque High Performance Computing.

Assignment Solving System of Linear Equations Using MPI Phạm Trần Vũ.

UNIT 3 TEMPLATE AND EXCEPTION HANDLING. Introduction  Program errors are also referred to as program bugs.  A C program may have one or more of four.

Outlines Chapter 3 –Chapter 3 – Loops & Revision –Loops while do … while – revision 1.

A First Book of C++: From Here To There, Third Edition2 Objectives You should be able to describe: Function and Parameter Declarations Returning a Single.

C++ Programming: From Problem Analysis to Program Design, Fifth Edition Arrays.

CPS120: Introduction to Computer Science Decision Making in Programs.

1 " Teaching Parallel Design Patterns to Undergraduates in Computer Science” Panel member SIGCSE The 45 th ACM Technical Symposium on Computer Science.

1 Chapter 9 Additional Control Structures Dale/Weems.

 2008 Pearson Education, Inc. All rights reserved. 1 Arrays and Vectors.

CSCI-455/552 Introduction to High Performance Computing Lecture 23.

Chapter 5 Methods 1. Motivations Method : groups statements that perform a function.  Level of abstraction (black box)  Code Reuse – no need to reinvent.

4 - Conditional Control Structures CHAPTER 4. Introduction A Program is usually not limited to a linear sequence of instructions. In real life, a programme.

Suzaku Pattern Programming Framework (a) Structure and low level patterns © 2015 B. Wilkinson Suzaku.pptx Modification date February 22,

LESSON 8: INTRODUCTION TO ARRAYS. Lesson 8: Introduction To Arrays Objectives: Write programs that handle collections of similar items. Declare array.

Unit 10 Code Reuse. Key Concepts Abstraction Header files Implementation files Storage classes Exit function Conditional compilation Command-line arguments.

1 ITCS4145 Parallel Programming B. Wilkinson March 23, hybrid-abw.ppt Hybrid Parallel Programming Introduction.

Repetitive Structures

User-Written Functions

A bit of C programming Lecture 3 Uli Raich.

© 2016 Pearson Education, Ltd. All rights reserved.

The switch Statement, and Introduction to Looping

CISC105 – General Computer Science

Hybrid Parallel Programming with the Paraguin compiler

Testing and Debugging.

Paraguin Compiler Examples.

Algorithm Analysis CSE 2011 Winter September 2018.

Sieve of Eratosthenes.

Lecture 07 More Repetition Richard Gesick.

Parallel Graph Algorithms

Lecture 4B More Repetition Richard Gesick

Arrays, For loop While loop Do while loop

Using compiler-directed approach to create MPI code automatically

Paraguin Compiler Examples.

Additional Control Structures

Numerical Algorithms • Parallelizing matrix multiplication

Using compiler-directed approach to create MPI code automatically

Java Programming Loops

Hybrid Parallel Programming

Paraguin Compiler Communication.

Paraguin Compiler Version 2.1.

Paraguin Compiler Examples.

Paraguin Compiler Version 2.1.

Pipeline Pattern ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, 2012 slides5.ppt Oct 24, 2013.

Programming in C Miscellaneous Topics.

Pattern Programming Tools

Programming in C Miscellaneous Topics.

Hybrid Parallel Programming

Notes on Assignment 3 OpenMP Stencil Pattern

Using compiler-directed approach to create MPI code automatically

Hybrid Parallel Programming

Vectors and Matrices In MATLAB a vector can be defined as row vector or as a column vector. A vector of length n can be visualized as matrix of size 1xn.

Week 4 Lecture-2 Chapter 6 (Methods).

EECE.2160 ECE Application Programming

Functions continued.

Introduction to Programming

Based on slides created by Bjarne Stroustrup & Tony Gaddis

Stencil Pattern ITCS 4/5145 Parallel computing, UNC-Charlotte, B. Wilkinson Jan 28,

Introduction to High Performance Computing Lecture 16

Parallel Graph Algorithms

Hybrid Parallel Programming

Chapter 1 c++ structure C++ Input / Output

Quiz Questions How does one execute code in parallel in Paraguin?

Files Chapter 8.

Presentation transcript:

Patterns Paraguin Compiler Version 2.1

Patterns As of right now, there are only two patterns implemented in Paraguin: Scatter/Gather (also known as master/slave) Stencil

Scatter/Gather Master prepares input Input is scatter to all processors Scatter 1 2 3 4 5 6 7 Gather Processors work independently (no communication) Partial results are gathered together to build final result

Scatter/Gather This pattern is done as a template rather than a single pragma Master prepares input Scatter input Compute partial results Gather partial results into the final result

Scatter/Gather Example Matrix Addition int main(int argc, char *argv[]) { int i, j, error = 0; double a[N][N], b[N][N], c[N][N]; char *usage = "Usage: %s file\n"; FILE *fd; if (argc < 2) { fprintf (stderr, usage, argv[0]); error = -1; } if (!error && (fd = fopen (argv[1], "r")) == NULL) { fprintf (stderr, "%s: Cannot open file %s for reading.\n", argv[0], argv[1]); Make sure we have the correct number of arguments Make sure we can open the input file The variable error is used to stop the other processors

Scatter/Gather Example Matrix Addition #pragma paraguin begin_parallel #pragma paraguin bcast error if (error) return error; #pragma paraguin end_parallel for (i = 0; i < N; i++) for (j = 0; j < N; j++) fscanf (fd, "%lf", &a[i][j]); fscanf (fd, "%lf", &b[i][j]); fclose(fd); The error code is broadcast to all processors so that they know to exit. If we just had a “return -1” in the above two if statements then the master only would exit and the workers would not, causing a deadlock. Master prepares input

Scatter/Gather Example Matrix Addition #pragma paraguin begin_parallel #pragma paraguin scatter a b // Parallelize the following loop nest assigning iterations // of the outermost loop (i) to different partitions. #pragma paraguin forall for (i = 0; i < N; i++) { for (j = 0; j < N; j++) { c[i][j] = a[i][j] + b[i][j]; } Scatter input Compute partial results Since this is a forall loop, each processors will compute a partition of the rows of the results

Scatter/Gather Example Matrix Addition ; #pragma paraguin gather c #pragma paraguin end_parallel Gather partial results into the final result This semicolon is here to prevent the gather pragma from being placed INSIDE the above for loop nest.

More on Scatter/Gather The scatter/gather pattern can also use either broadcast or reduction or both Master prepares input Broadcast input Compute partial results Reduce partial results into the final result

Integration To demonstrate Broadcast/Reduce, consider the problem if integrating a function using rectangles: As h approaches zero the area of the rectangles approaches the area under the curve between a and b y=f(x) f(x+h) f(x) a x x+h b

Scatter/Gather Example Integration double f(double x) { return 4.0 * sin(1.5*x) + 5; } int main(int argc, char *argv[]) char *usage = "Usage: %s a b N\n"; int i, error = 0, N; double a, b, x, y, h, area, overall_area; if (argc < 4) { fprintf (stderr, usage, argv[0]); error = -1; } else { Let f(x)=4sin(1.5x) + 5 Make sure we have the correct number of arguments The variable error is used to stop the other processors

Scatter/Gather Example Integration a = atof(argv[1]); b = atof(argv[2]); N = atoi(argv[3]); if (b <= a) { fprintf (stderr, "a should be smaller than b\n"); error = -1; } #pragma paraguin begin_parallel #pragma paraguin bcast error if (error) return error; Master prepares input The error code is broadcast to all processors so that they know to exit.

Scatter/Gather Example Integration ; #pragma paraguin bcast a b N h = (b - a) / N; area = 0.0; #pragma paraguin forall for (i = 0; i < N-1; i++) { x = a + i * h; y = f(x); area += y * h; } Broadcast input This semicolon is here to prevent the bcast pragma from being placed INSIDE the above if statement. Compute partial results Since this is a forall loop, each processors will compute a partition of the rectangles

Scatter/Gather Example Integration ; #pragma paraguin reduce sum area overall_area #pragma paraguin end_parallel Reduce partial results into the final result This semicolon is here to prevent the reduce pragma from being placed INSIDE the above for loop nest. Final area is in overal_area

Stencil Pattern

Jacobi Iteration

Jacobi Iteration Skip the boundary values int main() { int i, j; double A[N][M], B[N][M]; // A is initialized with data somehow for (time = 0; time < MAX_ITERATION; time++) { for (i = 1; i < N-1; i++) for (j = 1; j < M-1; j++) // Multiplying by 0.25 is faster than dividing by 4.0 B[i][j] = (A[i-1][j] + A[i+1][j] + A[i][j-1] + A[i][j+1]) * 0.25; A[i][j] = B[i][j]; } ... Skip the boundary values Newly computed values are placed in a new array. Then copied back to the original.

Improved Jacobi Iteration int main() { int i, j, current, next; double A[2][N][M]; // A[0] is initialized with data somehow and duplicated into A[1] current = 0; next = (current + 1) % 2; for (time = 0; time < MAX_ITERATION; time++) { for (i = 1; i < N-1; i++) for (j = 1; j < M-1; j++) A[next][i][j] = (A[current][i-1][j] + A[current][i+1][j] + A[current][i][j-1] + A[current][i][j+1]) * 0.25; current = next; } // Final result is in A[current] ... Add another dimension of size 2 to A. A[0] is old A and A[1] is old B We toggle between copies of the array This avoids copying values back into the original array.

Row versus Block Partitioning

Row versus Block Partitioning With block partitioning, we will need to communicate data across both rows and columns This will result in too much communication (too fine granularity) With row partitioning, each processor only needs to communication with at most 2 other processors

Communication Pattern with Row Partitioning

Paraguin Stencil Pragma A stencil pattern is done with a stencil pragma: #pragma paraguin stencil <data> <#rows> <#cols> <max_iterations> <fname> Where <data> is a 3 dimensional array 2 x #rows x #cols <max_iterations> is the number of iterations of the time loop <fname> is the name of a function to perform each calculation All on one line

Paraguin Stencil Pragma The function to perform each calculation should be declared as: <type> <fname> (<type> <data>[ ][ ], int i, int j) Where <type> is the base type of the array The function should calculate and return the value at location <data>[i][j]. It should not modify that value, but simply return the new value.

Paraguin Stencil Pragma int __guin_current = 0; // This is needed to access the last // copy of the data // Function to compute each value double computeValue (double A[][M], int i, int j) { return (A[i-1][j] + A[i+1][j] + A[i][j-1] + A[i][j+1]) * 0.25; }

Paraguin Stencil Pragma int main() { int i, j, n, m, max_iterations; double A[2][N][M]; // A[0] is initialized with data somehow and duplicated into A[1] #pragma paraguin begin_parallel n = N; m = M; max_iterations = TOTAL_TIME; #pragma paraguin stencil A n m max_iterations computeValue #pragma paraguin end_parallel // Final result is in A[__guin_current] or A[max_iterations % 2] } A has a 3rd dimension of size 2 All pragma parameters must be literals or variables. No preprocessors constants.

The Stencil Pragma is Replaced with Code to do: The 3-dimensional array given as an argument to the stencil pragma is broadcast to all available processors. __guin_current is set to zero and __guin_next is set to one. A loop is created to iterate max_iteration number of times. Within that loop, code is inserted to perform the following steps:

The Stencil Pragma is Replaced with Code to do: Each processor (except the last one) will send its last row to the processor with rank one more than its own rank. Each processor (except the first one) will receive the last row from the processor with rank one less than its own rank. Each processor (except the first one) will send its first row to the processor with rank one less than its own rank. Each processor (except the last one) will receive the first row from the processor with rank one more than its own rank.

The Stencil Pragma is Replaced with Code to do: Each processor will iterate through the values of the rows for which it is responsible and use the function provided compute the next value. __guin_current and __guin_next toggle The data is gathered back to the root processor (rank 0).

Stopping the Iterations Based Upon a Condition The stencil pattern will execute a fixed number of iterations What if we want to continue until the data converges to a solution For example, if the maximum difference between the values in A[0] and A[1] is less than a tolerance, like 0.0001 The problem with doing this in parallel is that it requires communication

Why Communication is Needed to Test For a Termination Condition There are 2 reasons inter-processor communication is needed the test for a termination condition: The data is scattered across processors; and The processors need to all agree whether to continue or terminate. Parts of the data may converge faster than others Some processors may decide to stop and others do not Without agreement, there will be a deadlock

Stopping the Iterations Based Upon a Condition We could put the stencil pragma inside a loop Each processor computes the max difference between the old and new values We reduce this to a final max value Broadcast it back out to decide whether or not to continue Problem: The stecil pattern has a built in Broadcast and Gather. Solution: StencilLite

Paraguin Stencil Pragma with Termination Condition int __guin_current = 0; // This is needed to access the last copy // of the data // Function to compute each value double computeValue (double A[][M], int i, int j) { return (A[i-1][j] + A[i+1][j] + A[i][j-1] + A[i][j+1]) * 0.25; } This part is the same.

Paraguin Stencil Pragma with Termination Condition int main() { int i, j, n, m, max_iterations, done; double A[2][N][M], *aPtr, diff, max_diff, tol; // A[0] is initialized with data somehow and duplicated into A[1] #pragma paraguin begin_parallel // tol is used to determine if the termination condition is met // When the change in values are ALL less than tol, the values // have converged sufficiently. tol = 0.0001; n = N; m = M; max_iterations = TOTAL_TIME; #pragma paraguin bcast A New variables Broadcast is now done by the user BEFORE the while loop begins

Paraguin Stencil Pragma with Termination Condition done = 0; // false while (!done) { ; // This is to make sure the following pragma is inside the while #pragma paraguin stencilLite A n m max_iterations computeValue // Each processor determines the maximum change in values of // the partition for which it is responsible. The loop bounds need // to be 1 and n-1 to match the bounds of the stencil. Otherwise, // the partitioning will be incorrect. max_diff = 0.0; #pragma paraguin forall for (i = 1; i < n - 1; i++) { for (j = 1; j < n - 1; j++) { diff = fabs(A[__guin_current][i][j] - A[__guin_next][i][j]); if (diff > max_diff) max_diff = diff; } Need a logical-controlled loop All processors determine the maximum absolute difference between the old values and the newly computed values.

Paraguin Stencil Pragma with Termination Condition ; // This is needed to prevent the pragma from being located in the // above loop nest // Reduce the max_diff's from all processors #pragma paraguin reduce max max_diff diff // Broadcast the diff so that all processors will agree to continue // or terminate #pragma paraguin bcast diff // Termination condition if the maximum change in values is less // than the tolerance. if (diff <= tol) done = 1; // true } Reduce to find the maximum difference across all processors. The variable diff is being reused here. Broadcast so all processors are in agreement.

Paraguin Stencil Pragma with Termination Condition aPtr = &A[__guin_current][1][0]; n = (N - 2) * M * sizeof(double); #pragma paraguin gather aPtr( n ) #pragma paraguin end_parallel // Final result is in A[__guin_current] // Cannot use max_iterations % 2 } The 1st and last rows of the array were not included in the partitioning This is why we use a pointer

Next Topic Job Scheduler Examples Matrix Addition Integration Sobel Edge Detection

Questions?