Patterns Paraguin Compiler Version 2.1.

Patterns Paraguin Compiler Version 2.1

Patterns As of right now, there are only two patterns implemented in Paraguin: Scatter/Gather (also known as master/slave) Stencil

Scatter/Gather Master prepares input
Input is scatter to all processors Scatter 1 2 3 4 5 6 7 Gather Processors work independently (no communication) Partial results are gathered together to build final result

Scatter/Gather This pattern is done as a template rather than a single pragma Master prepares input Scatter input Compute partial results Gather partial results into the final result

Scatter/Gather Example Matrix Addition
int main(int argc, char *argv[]) { int i, j, error = 0; double a[N][N], b[N][N], c[N][N]; char *usage = "Usage: %s file\n"; FILE *fd; if (argc < 2) { fprintf (stderr, usage, argv[0]); error = -1; } if (!error && (fd = fopen (argv[1], "r")) == NULL) { fprintf (stderr, "%s: Cannot open file %s for reading.\n", argv[0], argv[1]); Make sure we have the correct number of arguments Make sure we can open the input file The variable error is used to stop the other processors

#pragma paraguin begin_parallel #pragma paraguin bcast error if (error) return error; #pragma paraguin end_parallel for (i = 0; i < N; i++) for (j = 0; j < N; j++) fscanf (fd, "%lf", &a[i][j]); fscanf (fd, "%lf", &b[i][j]); fclose(fd); The error code is broadcast to all processors so that they know to exit. If we just had a “return -1” in the above two if statements then the master only would exit and the workers would not, causing a deadlock. Master prepares input

#pragma paraguin begin_parallel #pragma paraguin scatter a b // Parallelize the following loop nest assigning iterations // of the outermost loop (i) to different partitions. #pragma paraguin forall for (i = 0; i < N; i++) { for (j = 0; j < N; j++) { c[i][j] = a[i][j] + b[i][j]; } Scatter input Compute partial results Since this is a forall loop, each processors will compute a partition of the rows of the results

; #pragma paraguin gather c #pragma paraguin end_parallel Gather partial results into the final result This semicolon is here to prevent the gather pragma from being placed INSIDE the above for loop nest.

More on Scatter/Gather
The scatter/gather pattern can also use either broadcast or reduction or both Master prepares input Broadcast input Compute partial results Reduce partial results into the final result

Integration To demonstrate Broadcast/Reduce, consider the problem if integrating a function using rectangles: As h approaches zero the area of the rectangles approaches the area under the curve between a and b y=f(x) f(x+h) f(x) a x x+h b

Scatter/Gather Example Integration
double f(double x) { return 4.0 * sin(1.5*x) + 5; } int main(int argc, char *argv[]) char *usage = "Usage: %s a b N\n"; int i, error = 0, N; double a, b, x, y, h, area, overall_area; if (argc < 4) { fprintf (stderr, usage, argv[0]); error = -1; } else { Let f(x)=4sin(1.5x) + 5 Make sure we have the correct number of arguments The variable error is used to stop the other processors

a = atof(argv[1]); b = atof(argv[2]); N = atoi(argv[3]); if (b <= a) { fprintf (stderr, "a should be smaller than b\n"); error = -1; } #pragma paraguin begin_parallel #pragma paraguin bcast error if (error) return error; Master prepares input The error code is broadcast to all processors so that they know to exit.

; #pragma paraguin bcast a b N h = (b - a) / N; area = 0.0; #pragma paraguin forall for (i = 0; i < N-1; i++) { x = a + i * h; y = f(x); area += y * h; } Broadcast input This semicolon is here to prevent the bcast pragma from being placed INSIDE the above if statement. Compute partial results Since this is a forall loop, each processors will compute a partition of the rectangles

; #pragma paraguin reduce sum area overall_area #pragma paraguin end_parallel Reduce partial results into the final result This semicolon is here to prevent the reduce pragma from being placed INSIDE the above for loop nest. Final area is in overal_area

Stencil Pattern

Jacobi Iteration

Jacobi Iteration Skip the boundary values
int main() { int i, j; double A[N][M], B[N][M]; // A is initialized with data somehow for (time = 0; time < MAX_ITERATION; time++) { for (i = 1; i < N-1; i++) for (j = 1; j < M-1; j++) // Multiplying by 0.25 is faster than dividing by 4.0 B[i][j] = (A[i-1][j] + A[i+1][j] + A[i][j-1] + A[i][j+1]) * 0.25; A[i][j] = B[i][j]; } ... Skip the boundary values Newly computed values are placed in a new array. Then copied back to the original.

Improved Jacobi Iteration
int main() { int i, j, current, next; double A[2][N][M]; // A[0] is initialized with data somehow and duplicated into A[1] current = 0; next = (current + 1) % 2; for (time = 0; time < MAX_ITERATION; time++) { for (i = 1; i < N-1; i++) for (j = 1; j < M-1; j++) A[next][i][j] = (A[current][i-1][j] + A[current][i+1][j] + A[current][i][j-1] + A[current][i][j+1]) * 0.25; current = next; } // Final result is in A[current] ... Add another dimension of size 2 to A. A[0] is old A and A[1] is old B We toggle between copies of the array This avoids copying values back into the original array.

Row versus Block Partitioning

Row versus Block Partitioning
With block partitioning, we will need to communicate data across both rows and columns This will result in too much communication (too fine granularity) With row partitioning, each processor only needs to communication with at most 2 other processors

Communication Pattern with Row Partitioning

Paraguin Stencil Pragma
A stencil pattern is done with a stencil pragma: #pragma paraguin stencil <data> <#rows> <#cols> <max_iterations> <fname> Where <data> is a 3 dimensional array 2 x #rows x #cols <max_iterations> is the number of iterations of the time loop <fname> is the name of a function to perform each calculation All on one line

The function to perform each calculation should be declared as: <type> <fname> (<type> <data>[ ][ ], int i, int j) Where <type> is the base type of the array The function should calculate and return the value at location <data>[i][j]. It should not modify that value, but simply return the new value.

int __guin_current = 0; // This is needed to access the last // copy of the data // Function to compute each value double computeValue (double A[][M], int i, int j) { return (A[i-1][j] + A[i+1][j] + A[i][j-1] + A[i][j+1]) * 0.25; }

int main() { int i, j, n, m, max_iterations; double A[2][N][M]; // A[0] is initialized with data somehow and duplicated into A[1] #pragma paraguin begin_parallel n = N; m = M; max_iterations = TOTAL_TIME; #pragma paraguin stencil A n m max_iterations computeValue #pragma paraguin end_parallel // Final result is in A[__guin_current] or A[max_iterations % 2] } A has a 3rd dimension of size 2 All pragma parameters must be literals or variables. No preprocessors constants.

The Stencil Pragma is Replaced with Code to do:
The 3-dimensional array given as an argument to the stencil pragma is broadcast to all available processors. __guin_current is set to zero and __guin_next is set to one. A loop is created to iterate max_iteration number of times. Within that loop, code is inserted to perform the following steps:

Each processor (except the last one) will send its last row to the processor with rank one more than its own rank. Each processor (except the first one) will receive the last row from the processor with rank one less than its own rank. Each processor (except the first one) will send its first row to the processor with rank one less than its own rank. Each processor (except the last one) will receive the first row from the processor with rank one more than its own rank.

Each processor will iterate through the values of the rows for which it is responsible and use the function provided compute the next value. __guin_current and __guin_next toggle The data is gathered back to the root processor (rank 0).

Stopping the Iterations Based Upon a Condition
The stencil pattern will execute a fixed number of iterations What if we want to continue until the data converges to a solution For example, if the maximum difference between the values in A[0] and A[1] is less than a tolerance, like The problem with doing this in parallel is that it requires communication

Why Communication is Needed to Test For a Termination Condition
There are 2 reasons inter-processor communication is needed the test for a termination condition: The data is scattered across processors; and The processors need to all agree whether to continue or terminate. Parts of the data may converge faster than others Some processors may decide to stop and others do not Without agreement, there will be a deadlock

Stopping the Iterations Based Upon a Condition
We could put the stencil pragma inside a loop Each processor computes the max difference between the old and new values We reduce this to a final max value Broadcast it back out to decide whether or not to continue Problem: The stecil pattern has a built in Broadcast and Gather. Solution: StencilLite

Paraguin Stencil Pragma with Termination Condition
int __guin_current = 0; // This is needed to access the last copy // of the data // Function to compute each value double computeValue (double A[][M], int i, int j) { return (A[i-1][j] + A[i+1][j] + A[i][j-1] + A[i][j+1]) * 0.25; } This part is the same.

int main() { int i, j, n, m, max_iterations, done; double A[2][N][M], *aPtr, diff, max_diff, tol; // A[0] is initialized with data somehow and duplicated into A[1] #pragma paraguin begin_parallel // tol is used to determine if the termination condition is met // When the change in values are ALL less than tol, the values // have converged sufficiently. tol = ; n = N; m = M; max_iterations = TOTAL_TIME; #pragma paraguin bcast A New variables Broadcast is now done by the user BEFORE the while loop begins

done = 0; // false while (!done) { ; // This is to make sure the following pragma is inside the while #pragma paraguin stencilLite A n m max_iterations computeValue // Each processor determines the maximum change in values of // the partition for which it is responsible. The loop bounds need // to be 1 and n-1 to match the bounds of the stencil. Otherwise, // the partitioning will be incorrect. max_diff = 0.0; #pragma paraguin forall for (i = 1; i < n - 1; i++) { for (j = 1; j < n - 1; j++) { diff = fabs(A[__guin_current][i][j] - A[__guin_next][i][j]); if (diff > max_diff) max_diff = diff; } Need a logical-controlled loop All processors determine the maximum absolute difference between the old values and the newly computed values.

; // This is needed to prevent the pragma from being located in the // above loop nest // Reduce the max_diff's from all processors #pragma paraguin reduce max max_diff diff // Broadcast the diff so that all processors will agree to continue // or terminate #pragma paraguin bcast diff // Termination condition if the maximum change in values is less // than the tolerance. if (diff <= tol) done = 1; // true } Reduce to find the maximum difference across all processors. The variable diff is being reused here. Broadcast so all processors are in agreement.

aPtr = &A[__guin_current][1][0]; n = (N - 2) * M * sizeof(double); #pragma paraguin gather aPtr( n ) #pragma paraguin end_parallel // Final result is in A[__guin_current] // Cannot use max_iterations % 2 } The 1st and last rows of the array were not included in the partitioning This is why we use a pointer

Next Topic Job Scheduler Examples Matrix Addition Integration
Sobel Edge Detection

Questions?

Patterns Paraguin Compiler Version 2.1.

Similar presentations

Presentation on theme: "Patterns Paraguin Compiler Version 2.1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Patterns Paraguin Compiler Version 2.1.

Similar presentations

Presentation on theme: "Patterns Paraguin Compiler Version 2.1."— Presentation transcript:

Similar presentations

About project

Feedback