Using compiler-directed approach to create MPI code automatically

Using compiler-directed approach to create MPI code automatically
Paraguin Compiler Patterns ITCS4145/5145, Parallel Programming Clayton Ferner/B. Wilkinson March 11, ParagionSlides2abw.ppt

The Paraguin compiler is being developed by Dr
The Paraguin compiler is being developed by Dr. C Ferner, UNC-Wilmington Following based upon his slides

Patterns As of right now, there are only two patterns implemented in Paraguin: Scatter/Gather Stencil

Scatter/Gather Master prepares input
Input is scatter to all processors Scatter 1 2 3 4 5 6 7 Gather Processors work independently (no communication) Partial results are gathered together to build final result

Scatter/Gather This pattern is done as a template rather than a single pragma Master prepares input Scatter input Compute partial results Gather partial results into the final result

Scatter/Gather Example - Matrix Addition
int main(int argc, char *argv[]) { int i, j, error = 0; double a[N][N], b[N][N], c[N][N]; char *usage = "Usage: %s file\n"; FILE *fd; if (argc < 2) { fprintf (stderr, usage, argv[0]); error = -1; } if (!error && (fd = fopen (argv[1], "r")) == NULL) { fprintf (stderr, "%s: Cannot open file %s for reading.\n", argv[0],argv[1]); #pragma paraguin begin_parallel #pragma paraguin bcast error if (error) return error; #pragma paraguin end_parallel Make sure we have the correct number of arguments Make sure we can open the input file The variable error is used to stop the other processors error code broadcast to all processors so that they know to exit. If we just had a “return -1” in the above two if statements then the master only would exit and workers would not, causing a deadlock.

Compute partial results
for (i = 0; i < N; i++) for (j = 0; j < N; j++) fscanf (fd, "%lf", &a[i][j]); fscanf (fd, "%lf", &b[i][j]); fclose(fd); #pragma paraguin begin_parallel #pragma paraguin scatter a b // Parallelize loop nest assigning iterations // of outermost loop (i) to different partitions. #pragma paraguin forall for (i = 0; i < N; i++) { for (j = 0; j < N; j++) { c[i][j] = a[i][j] + b[i][j]; } ; #pragma paraguin gather c #pragma paraguin end_parallel Master prepares input Scatter input Compute partial results Semicolon to prevent gather pragma from being placed INSIDE above for loop nest. Gather partial results into the final result

More on Scatter/Gather
The scatter/gather pattern can also use either broadcast or reduction or both Master prepares input Broadcast input Compute partial results Reduce partial results into the final result

Broadcast/Reduce Example Integration
To demonstrate Broadcast/Reduce, consider the problem of integrating a function using rectangles: As h approaches zero the area of the rectangles approaches the area under the curve between a and b y=f(x) f(x+h) f(x) a x x+h b

Master prepares input Let f(x)=4sin(1.5x) + 5
double f(double x) { return 4.0 * sin(1.5*x) + 5; } int main(int argc, char *argv[]) { char *usage = "Usage: %s a b N\n"; int i, error = 0, N; double a, b, x, y, h, area, overall_area; if (argc < 4) { fprintf (stderr, usage, argv[0]); error = -1; } else { a = atof(argv[1]); b = atof(argv[2]); N = atoi(argv[3]); if (b <= a) { fprintf (stderr, "a should be smaller than b\n"); #pragma paraguin begin_parallel #pragma paraguin bcast error if (error) return error; Let f(x)=4sin(1.5x) + 5 f(x) need to run in parallel Previously functions needed // #pragma paraguin begin_parallel // #pragma paraguin end_parallel Make sure we have the correct number of arguments Master prepares input The variable error is used to stop the other processors Error code broadcast to all processors so that they know to exit.

Compute partial results
This semicolon is here to prevent the bcast pragma from being placed INSIDE the above if statement. Broadcast input ; #pragma paraguin bcast a b N h = (b - a) / N; area = 0.0; #pragma paraguin forall for (i = 0; i < N-1; i++) { x = a + i * h; y = f(x); area += y * h; } #pragma paraguin reduce sum area overall_area #pragma paraguin end_parallel Compute partial results Since this is a forall loop, each processors will compute a partition of the rectangles This semicolon is here to prevent the reduce pragma from being placed INSIDE the above for loop nest. Reduce partial results into the final result Final area is in overal_area

Stencil Pattern

Jacobi Iteration

Basic Jacobi Iteration
int main() { int i, j; double A[N][M], B[N][M]; // A is initialized with data somehow for (time = 0; time < MAX_ITERATION; time++) { for (i = 1; i < N-1; i++) for (j = 1; j < M-1; j++) B[i][j] = (A[i-1][j] + A[i+1][j] + A[i][j-1] + A[i][j+1]) * 0.25; A[i][j] = B[i][j]; } ... Skip the boundary values Multiplying by 0.25 is faster than dividing by 4.0 Newly computed values are placed in a new array. Then copied back to the original.

Improved Jacobi Iteration used in Paraguin
int main() { int i, j, current, next; double A[2][N][M]; // A[0] is initialized with data somehow and duplicated into A[1] current = 0; next = (current + 1) % 2; for (time = 0; time < MAX_ITERATION; time++) { for (i = 1; i < N-1; i++) for (j = 1; j < M-1; j++) A[next][i][j] = (A[current][i-1][j] + A[current][i+1][j] + A[current][i][j-1] + A[current][i][j+1]) * 0.25; current = next; } // Final result is in A[current] ... Add another dimension of size 2 to A. A[0] is old A and A[1] is old B We toggle between copies of the array This avoids copying values back into the original array.

Partitioning Row versus Block Partitioning
With block partitioning, we will need to communicate data across both rows and columns This will result in too much communication (too fine granularity) With row partitioning, each processor only needs to communication with at most 2 other processors.

Communication Pattern with Row Partitioning

Paraguin Stencil Pragma
A stencil pattern is done with a stencil pragma: #pragma paraguin stencil <data> <#rows> <#cols> <max_iterations> <fname> where <data> is a 3 dimensional array 2 x # rows x # cols <max_iterations> is the number of iterations of the time loop <fname> is the name of a function to perform each calculation

Paraguin Stencil Pragma Function <fname>
Function to perform each calculation should be declared as: <type> <fname> (<type> <data>[ ][ ], int i, int j) where <type> is the base type of the array and i, j is the location in the array to be computed The function should calculate and return the value at location <data>[i][j]. It should not modify that value, but simply return the new value.

Paraguin Stencil Program
int __guin_current = 0; // This is needed to access the last copy of the data double computeValue (double A[][M], int i, int j) { // Fn to compute each value return (A[i-1][j] + A[i+1][j] + A[i][j-1] + A[i][j+1]) * 0.25; } int main() { int i, j, n, m, max_iterations; double A[2][N][M]; // A[0] is initialized with data somehow and duplicated into A[1] #pragma paraguin begin_parallel n = N; m = M; max_iterations = TOTAL_TIME; #pragma paraguin stencil A n m max_iterations computeValue #pragma paraguin end_parallel // Final result is in A[__guin_current] or A[max_iterations % 2] Previously functions needed // #pragma paraguin begin_parallel // #pragma paraguin end_parallel A has a 3rd dimension of size 2 All pragma parameters must be literals or variables. No preprocessors constants.

The Stencil Pragma is Replaced with Code to do:
The 3-dimensional array given as an argument to the stencil pragma is broadcast to all available processors. __guin_current is set to zero and __guin_next is set to one. A loop is created to iterate max_iteration number of times. Within that loop, code is inserted to perform the following steps:

Each processor (except the last one) will send its last row to the processor with rank one more than its own rank. Each processor (except the first one) will receive the last row from the processor with rank one less than its own rank. Each processor (except the first one) will send its first row to the processor with rank one less than its own rank. Each processor (except the last one) will receive the first row from the processor with rank one more than its own rank. Each processor will iterate through the values of the rows for which it is responsible and use the function provided compute the next value. __guin_current and __guin_next toggle The data is gathered back to the root processor (rank 0).

Stopping the Iterations Based Upon a Condition
The stencil pattern will execute a fixed number of iterations What if we want to continue until the data converges to a solution For example, if the maximum difference between the values in A[0] and A[1] is less than a tolerance, like The problem with doing this in parallel is that it requires communication

Why Communication is Needed to Test For a Termination Condition
There are 2 reasons inter-processor communication is needed the test for a termination condition: The data is scattered across processors; and The processors need to all agree whether to continue or terminate. Parts of the data may converge faster than others Some processors may decide to stop and others do not Without agreement, there will be a deadlock

Paraguin Stencil Pragma with Termination Condition
int __guin_current = 0; // This is needed to access the last copy of the data // Function to compute each value double computeValue (double A[][M], int i, int j) { return (A[i-1][j] + A[i+1][j] + A[i][j-1] + A[i][j+1]) * 0.25; } int main() { int i, j, n, m, max_iterations, done; double A[2][N][M], diff, max_diff, tol; // A[0] is initialized with data somehow and duplicated into A[1] #pragma paraguin begin_parallel tol = ; n = N; m = M; max_iterations = TOTAL_TIME; This part is the same. New variables tol used to determine if termination condition met. When change in values are ALL less than tol, values have converged sufficiently. Initializations are within the parallel region

done = 0; // false while (!done) { ; #pragma paraguin stencil A n m max_iterations computeValue max_diff = 0.0; #pragma paraguin forall for (i = 1; i < n - 1; i++) { for (j = 1; j < n - 1; j++) { diff = fabs(A[__guin_current][i][j] - A[__guin_next][i][j]); if (diff > max_diff) max_diff = diff; } Need a logical-controlled loop To make sure following pragma is inside the while Each processor determines max change in values of its partition. Loop bounds need to be 1 and n-1 to match bounds of stencil. Otherwise, partitioning will be incorrect. All processors determine the maximum absolute difference between the old values and the newly computed values.

Needed to prevent pragma from being located in above loop nest Reduce to find the maximum difference across all processors. ; // Reduce the max_diff's from all processors #pragma paraguin reduce max max_diff diff #pragma paraguin bcast diff if (diff <= tol) done = 1; // true } #pragma paraguin end_parallel // Final result is in A[__guin_current]. Cannot use max_iterations % 2 The variable diff is being reused here. Broadcast diff so that all processes will agree to continue or terminate Termination condition if max change in values is less than tolerance.

Questions?

Using compiler-directed approach to create MPI code automatically

Similar presentations

Presentation on theme: "Using compiler-directed approach to create MPI code automatically"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Using compiler-directed approach to create MPI code automatically

Similar presentations

Presentation on theme: "Using compiler-directed approach to create MPI code automatically"— Presentation transcript:

Similar presentations

About project

Feedback