Paraguin Compiler Examples
Examples Matrix Addition (the complete program) Traveling Salesman Problem (TSP) Sobel Edge Detection
Matrix Addition The complete program
Matrix Addition (complete) #define N 512 #ifdef PARAGUIN typedef void* __builtin_va_list; extern int MPI_COMM_WORLD; extern int MPI_Barrier(); #endif #include <stdio.h> #include <math.h> #include <sys/time.h> print_results(char *prompt, float a[N][N]); int main(int argc, char *argv[]) { int i, j; float a[N][N], b[N][N], c[N][N]; char *usage = "Usage: %s file\n"; FILE *fd;
Matrix Addition (complete) double elapsed_time; struct timeval tv1, tv2; if (argc < 2) { fprintf (stderr, usage, argv[0]); return -1; } if ((fd = fopen (argv[1], "r")) == NULL) { fprintf (stderr, "%s: Cannot open file %s for reading.\n", argv[0], argv[1]);
Matrix Addition (complete) // Read input from file for matrices a and b. // The I/O is not timed because this I/O needs // to be done regardless of whether this program // is run sequentially on one processor or in // parallel on many processors. Therefore, it is // irrelevant when considering speedup. for (i = 0; i < N; i++) for (j = 0; j < N; j++) fscanf (fd, "%f", &a[i][j]); fscanf (fd, "%f", &b[i][j]);
Matrix Addition (complete) #ifdef PARAGUIN ; #pragma paraguin begin_parallel // This barrier is here so that we can take a time stamp // Once we know all processes are ready to go. MPI_Barrier(MPI_COMM_WORLD); #pragma paraguin end_parallel #endif // Take a time stamp gettimeofday(&tv1, NULL); // Broadcast the input to all processors. This could be // faster if we used scatter, but Bcast is easy and scatter // is not implemented in Paraguin #pragma paraguin bcast a b
Matrix Addition (complete) // Parallelize the following loop nest assigning iterations // of the outermost loop (i) to different partitions. #pragma paraguin forall C p i j \ 0x0 -1 1 0x0 \ 0x0 1 -1 0x0 // We need to gather all values c[i][j]. So we can just // use i,j => 0. #pragma paraguin gather 0x0 C i j \ 0x0 0x0 0x0 for (i = 0; i < N; i++) { for (j = 0; j < N; j++) { c[i][j] = a[i][j] + b[i][j]; }
Matrix Addition (complete) ; #pragma paraguin end_parallel // Take a time stamp. This won't happen until after the master // process has gathered all the input from the other processes. gettimeofday(&tv2, NULL); elapsed_time = (tv2.tv_sec - tv1.tv_sec) + ((tv2.tv_usec - tv1.tv_usec) / 1000000.0); printf ("elapsed_time=\t%lf (seconds)\n", elapsed_time); // print result print_results("C = ", c); }
Matrix Addition (complete) print_results(char *prompt, float a[N][N]) { int i, j; printf ("\n\n%s\n", prompt); for (i = 0; i < N; i++) { for (j = 0; j < N; j++) { printf(" %.2f", a[i][j]); } printf ("\n"); printf ("\n\n");
Matrix Addition After compiling with the command: This produces: runparaguin matrixadd.c This produces: matrixadd.out.c (source with MPI) matrixadd.out (compiled with mpicc) (Demonstration)
Partitioning Reviewed #pragma paraguin forall C p i j \ 0x0 -1 1 0x0 \ 0x0 1 -1 0x0 The expression above assigns each iteration of the i loop to its own partition (p = i). We could also partition along the j loop: 0x0 -1 0x0 1 \ 0x0 1 0x0 -1 Or would could have many other partitions
Partitioning Reviewed The partitioning is a system of inequalities written in matrix/vector form: where is a matrix, and and are vectors.
Partitioning Reviewed So the partition expressed in the pragma: #pragma paraguin forall C p i j \ 0x0 -1 1 0x0 \ 0x0 1 -1 0x0 Represents the following:
Partitioning Reviewed If we multiply this out: We get:
Partitioning Reviewed Now simplify:
Partitioning Reviewed #pragma paraguin forall C p i j \ 0x0 -1 1 0x0 \ 0x0 1 -1 0x0 j p = i p=0 p=1 p=2 p=3 p=4 p=5 p=6 p=7 p=8 p=9 i p=10 p=11
Partitioning Reviewed So the partition expressed in the pragma: #pragma paraguin forall C p i j \ 0x0 -1 0x0 1 \ 0x0 1 0x0 -1 Represents the following:
Partitioning Reviewed If we multiply this out: We get:
Partitioning Reviewed Now simplify:
Partitioning Reviewed #pragma paraguin forall C p i j \ 0x0 -1 0x0 1 \ 0x0 1 0x0 -1 p=11 p=10 p=9 p=8 p=7 p=6 j p=5 p = j p=4 p=3 p=2 p=1 p=0 i
Partitioning Reviewed Let’s say we want to partition using p=i+j We actually have to go the other direction
Partitioning Reviewed
Partitioning Reviewed To write this as a pragma: #pragma paraguin forall C p i j \ 0x0 -1 1 1 \ 0x0 1 -1 -1
Partitioning Reviewed #pragma paraguin forall C p i j \ 0x0 -1 1 1 \ 0x0 1 -1 -1 p=23 p=22 p=21 p=20 p=19 p=18 p=17 p=16 p = i + j p=15 j p=14 p=13 p=12 p=11 p=10 p=9 p=8 p=7 p=6 p=5 p=4 p=3 p=2 p=1 i
Traveling Salesman Problem (TSP)
The Traveling Salesman Problem is simply to find the shortest circuit (Hamiltonian circuit) that visits every city in a set of cities at most once
This problem falls into the class of “NP-hard” problems What that means is that there is no known “polynomial” time (“big-oh” of a polynomial) algorithm that can solve it The only know algorithm to solve it is to compare the distances of all possible Hamiltonian circuits. But there are N! possible circuits of N cities.
Yes heuristics can be applied to find a “good” solution fast, but there’s no guarantee it is the best The “brute force” algorithm is to consider all possible permutations of the N cities First we’ll fix the first city since there are N equivalent circuits where we rotate the cities We will consider the reverse directions to be different circuits but that’s hard to account for
If we number the cities from 0 to N-1, and 0 is the origination city, then the possible permutations of 4 cities are: 0->1->2->3->0 0->1->3->2->0 0->2->3->1->0 0->2->1->3->0 0->3->1->2->0 0->3->2->1->0 Notice that there are some permutations that are the reverse of other. These are equivalent permutations. Since we are fixing origination city, there are (N-1)! permutations.
We can compute the distances between all pairs of locations (O(N2)) This is the input City 0 City 1 City 2 City 3 77.301157 66.648884 10.524875 71.335061 79.977022 59.265103
Solution: Use a for loop to assign the first two cities Problem: Iterating through the possible permutations is recursive, but we need a straight forward for loop to parallelize Solution: Use a for loop to assign the first two cities Since city 0 is fixed, there are n-1 choices for city 1 and n-2 choices for city 2 That means there are (n-1)(n-2) = n2 – 3n + 2 combinations of the first two cities
Assignment of cities 0-2 N = n*n - 3*n + 2; // (n-1)(n-2) perm[0] = 0; for (i = 0; i < N; i++) { perm[1] = i / (n-2) + 1; perm[2] = i % (n-2) + 1; ...
; #pragma paraguin begin_parallel perm[0] = 0; minDist = -1 ; #pragma paraguin begin_parallel perm[0] = 0; minDist = -1.0; if (n == 2) { perm[1] = 1; // If n == 2, then N == 0, // and we are done. minPerm[0] = perm[0]; minPerm[1] = perm[1]; minDist = computeDist(D, n, perm); } #pragma paraguin bcast n #pragma paraguin bcast N #pragma paraguin bcast D
#pragma paraguin forall C p N i \ 0x0 -1 0x0 1 \ 0x0 1 0x0 -1 for (i = 0; i < N; i++) { perm[1] = i / (n-2) + 1; perm[2] = i % (n-2) + 1; ...
Sobel Edge Detection
Sobel Edge Detection Given an image, the problem is to detect where the “edges” are in the picture
Sobel Edge Detection
Sobel Edge Detection Algorithm /* 3x3 Sobel masks. */ GX[0][0] = -1; GX[0][1] = 0; GX[0][2] = 1; GX[1][0] = -2; GX[1][1] = 0; GX[1][2] = 2; GX[2][0] = -1; GX[2][1] = 0; GX[2][2] = 1; GY[0][0] = 1; GY[0][1] = 2; GY[0][2] = 1; GY[1][0] = 0; GY[1][1] = 0; GY[1][2] = 0; GY[2][0] = -1; GY[2][1] = -2; GY[2][2] = -1; for(x=0; x < N; ++x){ for(y=0; y < N; ++y){ sumx = 0; sumy = 0; // handle image boundaries if(x==0 || x==(h-1) || y==0 || y==(w-1)) sum = 0; else{ Pragmas go here
Sobel Edge Detection Algorithm //x gradient approx for(i=-1; i<=1; i++) for(j=-1; j<=1; j++) sumx += (grayImage[x+i][y+j] * GX[i+1][j+1]); //y gradient approx sumy += (grayImage[x+i][y+j] * GY[i+1][j+1]); //gradient magnitude approx sum = (abs(sumx) + abs(sumy)); } edgeImage[x][y] = clamp(sum);
Sobel Edge Detection Algorithm Inputs (that need to be broadcast or scattered): GX and GY arrays grayImage array w and h (width and height) There are 4 nested loops (x, y, i, and j) The final answer is the array edgeImage
Sobel Edge Detection Algorithm We put these in front of that loop to parallelize it. ; #pragma paraguin begin_parallel #pragma paraguin bcast grayImage #pragma paraguin bcast w #pragma paraguin bcast h #pragma paraguin forall C p x y i j \ 0x0 -1 1 0x0 0x0 0x0 \ 0x0 1 -1 0x0 0x0 0x0 #pragma paraguin gather 4 C x y \ 0x0 0x0 0x0 These are the inputs Partition the x loop (outermost loop) Gather all elements of the edgeImage array