Paraguin Compiler Examples.

Slides:



Advertisements
Similar presentations
Practical techniques & Examples
Advertisements

Reducibility Class of problems A can be reduced to the class of problems B Take any instance of problem A Show how you can construct an instance of problem.
Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.
Homework page 102 questions 1, 4, and 10 page 106 questions 4 and 5 page 111 question 1 page 119 question 9.
12d.1 Two Example Parallel Programs using MPI UNC-Wilmington, C. Ferner, 2007 Mar 209, 2007.
1 " Teaching Parallel Design Patterns to Undergraduates in Computer Science” Panel member SIGCSE The 45 th ACM Technical Symposium on Computer Science.
1 File Handling. 2 Storage seen so far All variables stored in memory Problem: the contents of memory are wiped out when the computer is powered off Example:
Exhaustive Search. Brute Force Methods  guarantee best fitness value is found  feasible for ‘small’ data sets only.
Chapter 7 : File Processing1 File-Oriented Input & Output CHAPTER 7.
1 CHAPTER6 CHAPTER 6. Objectives: You’ll learn about;  Introduction  Files and streams  Creating a sequential access file  Reading data from a sequential.
MPI and OpenMP.
2.1 Collective Communication Involves set of processes, defined by an intra-communicator. Message tags not present. Principal collective operations: MPI_BCAST()
Using Compiler Directives Paraguin Compiler 1 © 2013 B. Wilkinson/Clayton Ferner SIGCSE 2013 Workshop 310 session2a.ppt Modification date: Jan 9, 2013.
Message Passing Interface Using resources from
Chapter 3 Brute Force Copyright © 2007 Pearson Addison-Wesley. All rights reserved.
Exhaustive search Exhaustive search is simply a brute- force approach to combinatorial problems. It suggests generating each and every element of the problem.
Suzaku Pattern Programming Framework (a) Structure and low level patterns © 2015 B. Wilkinson Suzaku.pptx Modification date February 22,
ITCS 4/5145 Parallel Computing, UNC-Charlotte, B
1 ITCS4145 Parallel Programming B. Wilkinson March 23, hybrid-abw.ppt Hybrid Parallel Programming Introduction.
Lesson #5 Repetition and Loops.
The NP class. NP-completeness
LINKED LISTS.
Lesson #8 Structures Linked Lists Command Line Arguments.
User-Written Functions
Hard Problems Some problems are hard to solve.
EMIS 8373: Integer Programming
A bit of C programming Lecture 3 Uli Raich.
Hybrid Parallel Programming with the Paraguin compiler
Lesson #5 Repetition and Loops.
Richard Anderson Lecture 26 NP-Completeness
Sieve of Eratosthenes.
Chapter 3 Brute Force Copyright © 2007 Pearson Addison-Wesley. All rights reserved.
CS 668: Lecture 3 An Introduction to MPI
Paraguin Compiler Examples.
Unsolvable Problems December 4, 2017.
Sieve of Eratosthenes.
Parallel Graph Algorithms
Parallel Programming with MPI and OpenMP
CS 584.
Big-Oh and Execution Time: A Review
Using compiler-directed approach to create MPI code automatically
Introduction to Message Passing Interface (MPI)
Lesson #5 Repetition and Loops.
Paraguin Compiler Examples.
ITCS 4/5145 Parallel Computing, UNC-Charlotte, B
Using compiler-directed approach to create MPI code automatically
Hybrid Parallel Programming
Paraguin Compiler Communication.
Paraguin Compiler Version 2.1.
CSCE569 Parallel Computing
Paraguin Compiler Version 2.1.
File Handling.
Chapter 3 Brute Force Copyright © 2007 Pearson Addison-Wesley. All rights reserved.
Pattern Programming Tools
3. Brute Force Selection sort Brute-Force string matching
Hybrid Parallel Programming
Using compiler-directed approach to create MPI code automatically
Hybrid Parallel Programming
3. Brute Force Selection sort Brute-Force string matching
Patterns Paraguin Compiler Version 2.1.
C Preprocessing File I/O
CS 584 Project Write up Poster session for final Due on day of final
The Theory of NP-Completeness
Lesson #5 Repetition and Loops.
Parallel Graph Algorithms
Hybrid Parallel Programming
Incremental Programming
Quiz Questions How does one execute code in parallel in Paraguin?
3. Brute Force Selection sort Brute-Force string matching
Presentation transcript:

Paraguin Compiler Examples

Examples Matrix Addition (the complete program) Traveling Salesman Problem (TSP) Sobel Edge Detection

Matrix Addition The complete program

Matrix Addition (complete) #define N 512 #ifdef PARAGUIN typedef void* __builtin_va_list; #endif #include <stdio.h> #include <math.h> #include <sys/time.h> print_results(char *prompt, float a[N][N]); int main(int argc, char *argv[]) { int i, j, error = 0; float a[N][N], b[N][N], c[N][N]; char *usage = "Usage: %s file\n"; FILE *fd; double elapsed_time; struct timeval tv1, tv2;

Matrix Addition (complete) if (argc < 2) { fprintf (stderr, usage, argv[0]); error = -1; } if ((fd = fopen (argv[1], "r")) == NULL) { fprintf (stderr, "%s: Cannot open file %s for reading.\n", argv[0], argv[1]); #pragma paraguin begin_parallel #pragma paraguin bcast error if (error) return -1; #pragma paraguin end_parallel

Matrix Addition (complete) // Read input from file for matrices a and b. // The I/O is not timed because this I/O needs // to be done regardless of whether this program // is run sequentially on one processor or in // parallel on many processors. Therefore, it is // irrelevant when considering speedup. for (i = 0; i < N; i++) for (j = 0; j < N; j++) fscanf (fd, "%f", &a[i][j]); fscanf (fd, "%f", &b[i][j]); fclose (fd);

Matrix Addition (complete) ; #pragma paraguin begin_parallel // This barrier is here so that we can take a time stamp // Once we know all processes are ready to go. #pragma paraguin barrier #pragma paraguin end_parallel // Take a time stamp gettimeofday(&tv1, NULL); #pragma paraguin scatter a b // Parallelize the following loop nest assigning iterations // of the outermost loop (i) to different partitions. #pragma paraguin forall for (i = 0; i < N; i++) for (j = 0; j < N; j++) c[i][j] = a[i][j] + b[i][j];

Matrix Addition (complete) ; #pragma paraguin gather c #pragma paraguin end_parallel // Take a time stamp. This won't happen until after the master // process has gathered all the input from the other processes. gettimeofday(&tv2, NULL); elapsed_time = (tv2.tv_sec - tv1.tv_sec) + ((tv2.tv_usec - tv1.tv_usec) / 1000000.0); printf ("elapsed_time=\t%lf (seconds)\n", elapsed_time); // print result print_results("C = ", c); }

Matrix Addition (complete) print_results(char *prompt, float a[N][N]) { int i, j; printf ("\n\n%s\n", prompt); for (i = 0; i < N; i++) { for (j = 0; j < N; j++) { printf(" %.2f", a[i][j]); } printf ("\n"); printf ("\n\n");

Matrix Addition After compiling with the command: This produces: scc –DPARAGUIN –D__x86_64__ matrixadd.c –cc mpicc \ –o matrixadd.out This produces: matrixadd.out scc –DPARAGUIN –D__x86_64__ matrixadd.c -.out.c matrixadd.out .c (MPI source code) All on one line

Traveling Salesman Problem (TSP)

The Traveling Salesman Problem is simply to find the shortest circuit (Hamiltonian circuit) that visits every city in a set of cities at most once

This problem falls into the class of “NP-hard” problems What that means is that there is no known “polynomial” time (“big-oh” of a polynomial) algorithm that can solve it The only know algorithm to solve it is to compare the distances of all possible Hamiltonian circuits. But there are N! possible circuits of N cities.

Yes, heuristics can be applied to find a “good” solution fast, but there’s no guarantee it is the best The “brute force” algorithm is to consider all possible permutations of the N cities First we’ll fix the first city since there are N equivalent circuits where we rotate the cities We will consider the reverse directions to be different circuits but that’s hard to account for

If we number the cities from 0 to N-1, and 0 is the origination city, then the possible permutations of 4 cities are: 0->1->2->3->0 0->1->3->2->0 0->2->3->1->0 0->2->1->3->0 0->3->1->2->0 0->3->2->1->0 Notice that there are some permutations that are the reverse of other. These are equivalent permutations. Since we are fixing origination city, there are (N-1)! permutations instead of N!.

We can compute the distances between all pairs of locations (O(N2)) This is the input City 0 City 1 City 2 City 3 77.301157 66.648884 10.524875 71.335061 79.977022 59.265103

Solution: Use a for loop to assign the first two cities Problem: Iterating through the possible permutations is recursive, but we need a straight forward for loop to parallelize Solution: Use a for loop to assign the first two cities Since city 0 is fixed, there are n-1 choices for city 1 and n-2 choices for city 2 That means there are (n-1)(n-2) = n2 – 3n + 2 combinations of the first two cities

Assignment of cities 0-2 N = n*n - 3*n + 2; // (n-1)(n-2) perm[0] = 0; for (i = 0; i < N; i++) { perm[1] = i / (n-2) + 1; perm[2] = i % (n-2) + 1; ...

// This structure is used for the "minloc" reduction // This structure is used for the "minloc" reduction. We want to find the // minimum distance as well as which processor found it so that we can // get the final minimum circuit. struct { float minDist; int rank; } myAnswer, resultAnswer; int main(int argc, char *argv[]) { int i, j, k, N, p; int perm[MAX_NUM_CITIES], minPerm[MAX_NUM_CITIES+1]; float D[MAX_NUM_CITIES][MAX_NUM_CITIES]; float dist, minDist, finalMinDist; int abort; To use minloc, we need a value to minimize and the location (rank)

abort = processArgs(argc, argv); if ( abort = processArgs(argc, argv); if (!abort) { for (i = 0; i < n; i++) { D[i][i] = 0.0f; for (j = 0; j < i; j++) { fscanf (fd, "%f", &D[i][j]); D[j][i] = D[i][j]; } } else { if (n <= 1) printf ("0 0 0\n"); else n = 0; #pragma paraguin begin_parallel #pragma paraguin bcast abort if (abort) return -1; Read in the lower triangular matrix of relative distances between cities. The values are mirrored across the diagonal.

#pragma paraguin bcast n D perm[0] = 0; minDist = 9 #pragma paraguin bcast n D perm[0] = 0; minDist = 9.0e10; // Near the largest value we can represent with a float if (n == 2) { perm[1] = 1; // If n = 2, the N = 0, and we are done. minPerm[0] = perm[0]; minPerm[1] = perm[1]; minDist = computeDist(D, n, perm); } N = n*n - 3*n + 2; // N = (n-1)(n-2)

#pragma paraguin forall for (p = 0; p < N; p++) { perm[1] = p / (n-2) + 1; perm[2] = p % (n-2) + 1; if (perm[2] >= perm[1]) perm[2]++; initialize(perm, n, 3); do { dist = computeDist(D, n, perm); if (minDist > dist) { minDist = dist; for (i = 0; i < n; i++) minPerm[i] = perm[i]; } } while (increment(perm,n)); After cities 0, 1, and 2 have been determined, initialize the rest of the cities for the first permutation. Keep the shortest circuit seen thus far. Move to the next permutation

myAnswer. minDist = minDist; myAnswer myAnswer.minDist = minDist; myAnswer.rank = __guin_rank; #pragma paraguin reduce minloc myAnswer resultAnswer #pragma paraguin bcast resultAnswer if (__guin_rank == resultAnswer.rank) { printf (" %f ", minDist); for (i = 0; i < n; i++) printf ("%d ", minPerm[i]); printf ("%d\n", minPerm[0]); } #pragma paraguin end_parallel struct { float minDist; int rank; } myAnswer, resultAnswer; If the current processors is the one who found the solution, then report the solution.

Demonstration

Sobel Edge Detection

Sobel Edge Detection Given an image, the problem is to detect where the “edges” are in the picture

Sobel Edge Detection

Sobel Edge Detection Algorithm /* 3x3 Sobel masks. */ GX[0][0] = -1; GX[0][1] = 0; GX[0][2] = 1; GX[1][0] = -2; GX[1][1] = 0; GX[1][2] = 2; GX[2][0] = -1; GX[2][1] = 0; GX[2][2] = 1; GY[0][0] = 1; GY[0][1] = 2; GY[0][2] = 1; GY[1][0] = 0; GY[1][1] = 0; GY[1][2] = 0; GY[2][0] = -1; GY[2][1] = -2; GY[2][2] = -1; for(x=0; x < N; ++x){ for(y=0; y < N; ++y){ sumx = 0; sumy = 0; // handle image boundaries if(x==0 || x==(h-1) || y==0 || y==(w-1)) sum = 0; else{

Sobel Edge Detection Algorithm //x gradient approx for(i=-1; i<=1; i++) for(j=-1; j<=1; j++) sumx += (grayImage[x+i][y+j] * GX[i+1][j+1]); //y gradient approx sumy += (grayImage[x+i][y+j] * GY[i+1][j+1]); //gradient magnitude approx sum = (abs(sumx) + abs(sumy)); } edgeImage[x][y] = clamp(sum); There are no loop-carried dependencies. Therefore, this is a Scatter/Gather pattern.

Loop Carried Dependency A Loop-Carried Dependency is when a value is computed in one iteration and used in another for (i = 1; i < n; i++) { A[i] = f(A[i-1]); <1>: a[1] = f(a[0]); <2>: a[2] = f(a[1]); <3>: a[3] = f(a[2]); <4>: a[4] = f(a[3]); ... If we run this loop as a forall, then inter- processors communication is needed

Sobel Edge Detection Algorithm Inputs (that need to be broadcast or scattered): GX and GY arrays grayImage array w and h (width and height) There are 4 nested loops (x, y, i, and j) The final answer is the array edgeImage

Sobel Edge Detection Algorithm #pragma paraguin begin_parallel /* 3x3 Sobel masks. */ GX[0][0] = -1; GX[0][1] = 0; GX[0][2] = 1; GX[1][0] = -2; GX[1][1] = 0; GX[1][2] = 2; GX[2][0] = -1; GX[2][1] = 0; GX[2][2] = 1; GY[0][0] = 1; GY[0][1] = 2; GY[0][2] = 1; GY[1][0] = 0; GY[1][1] = 0; GY[1][2] = 0; GY[2][0] = -1; GY[2][1] = -2; GY[2][2] = -1; #pragma paraguin bcast grayImage w h #pragma paraguin forall 1 for(x=0; x < N; ++x){ for(y=0; y < N; ++y){ sumx = 0; sumy = 0; ... These are the inputs Partition the x loop (outermost loop) Using cyclic scheduling

Sobel Edge Detection Algorithm ... edgeImage[x][y] = clamp(sum); } ; #pragma paraguin gather edgeImage Gather all elements of the edgeImage array

Questions