CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof. Thomas Sterling Dr. Hartmut Kaiser Department of Computer Science Louisiana State University March 10 th, 2011
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 Dr. Hartmut Kaiser Center for Computation & Technology R315 Johnston 2
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 Puzzle of the Day What’s the difference between the following valid C function declarations: void foo(); void foo(void); void foo(…);
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 Puzzle of the Day What’s the difference between the following valid C function declarations: What’s the difference between the following valid C++ function declarations: void foo(); void foo(void); void foo(…); void foo(); any number of parameters void foo(void); no parameter void foo(…); any number of parameters
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 Puzzle of the Day What’s the difference between the following valid C function declarations: void foo(); any number of parameters void foo(void); no parameters void foo(…); any number of parameters What’s the difference between the following valid C++ function declarations: void foo(); no parameters void foo(void); no parameters void foo(…); any number of parameters
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring Topics Introduction Mandelbrot Sets Monte Carlo : PI Calculation Vector Dot-Product Matrix Multiplication
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring Topics Introduction Mandelbrot Sets Monte Carlo : PI Calculation Vector Dot-Product Matrix Multiplication
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring Parallel Programming Goals –Correctness –Reduction in execution time –Efficiency –Scalability –Increased problem size and richness of models Objectives –Expose parallelism Algorithm design –Distribute work uniformly Data decomposition and allocation Dynamic load balancing –Minimize overhead of synchronization and communication Coarse granularity Big messages –Minimize redundant work Still sometimes better than communication
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring Basic Parallel (MPI) Program Steps Establish logical bindings Initialize application execution environment Distribute data and work Perform core computations in parallel (across nodes) Synchronize and Exchange intermediate data results –Optional for non-embarrassingly parallel (cooperative) Detect “stop” condition –Maybe implicit with a barrier etc. Aggregate final results –Often a reduction operator Output results and error code Terminate and return to OS
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring “embarrassingly parallel” Common phrase –poorly defined, –widely used Suggests lots and lots of parallelism –with essentially no inter task communication or coordination –Highly partitionable workload with minimal overhead “almost embarrassingly parallel” –Same as above, but –Requires master to launch many tasks –Requires master to collect final results of tasks –Sometimes still referred to as “embarrassingly parallel”
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring Topics Introduction Mandelbrot Sets Monte Carlo : PI Calculation Vector Dot-Product Matrix Multiplication
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 Mandelbrot set Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. 12
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. Mandelbrot Set Set of points in a complex plane that are quasi-stable (will increase and decrease, but not exceed some limit) when computed by iterating the function where z k+1 is the (k + 1) th iteration of the complex number z = (a + bi) and c is a complex number giving position of point in the complex plane. The initial value for z is zero. Iterations continued until magnitude of z is greater than 2 or number of iterations reaches arbitrary limit. Magnitude of z is the length of the vector given by 13
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. Sequential routine computing value of one point returning number of iterations structure complex { float real; float imag; }; int cal_pixel(complex c) { int count, max; complex z; float temp, lengthsq; max = 256; z.real = 0; z.imag = 0; count = 0; /* number of iterations */ do { temp = z.real * z.real - z.imag * z.imag + c.real; z.imag = 2 * z.real * z.imag + c.imag; z.real = temp; lengthsq = z.real * z.real + z.imag * z.imag; count++; } while ((lengthsq < 4.0) && (count < max)); return count; } 14
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 Parallelizing Mandelbrot Set Computation Static Task Assignment Simply divide the region into fixed number of parts, each computed by a separate processor. Not very successful because different regions require different numbers of iterations and time. Dynamic Task Assignment Have processor request regions after computing previous regions Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. 15
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. Dynamic Task Assignment Work Pool/Processor Farms 16
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring Flowchart for Mandelbrot Set Generation “master”“workers” Initialize MPI Environment … Create Local Workload buffer … Isolate work regions Calculate Mandelbrot set values across work region … … Write result from task 0 to file Recv. results from “workers” Send result to “master” … Concatenate results to file End
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring Mandelbrot Sets (source code) #include typedef struct complex{ double real; double imag; } Complex; int cal_pixel(Complex c){ int count, max_iter; Complex z; double temp, lengthsq; max_iter = 256; z.real = 0; z.imag = 0; count = 0; do{ temp = z.real * z.real - z.imag * z.imag + c.real; z.imag = 2 * z.real * z.imag + c.imag; z.real = temp; lengthsq = z.real * z.real + z.imag * z.imag; count ++; } while ((lengthsq < 4.0) && (count < max_iter)); return(count); } Source : cal_pixel () runs on every worker process calculates the : for every pixel
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring Mandelbrot Sets (source code) #define MASTERPE 0 int main(int argc, char **argv){ FILE *file; int i, j; int tmp; Complex c; double *data_l, *data_l_tmp; int nx, ny; int mystrt, myend; int nrows_l; int nprocs, mype; MPI_Status status; /***** Initializing MPI Environment*****/ MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); MPI_Comm_rank(MPI_COMM_WORLD, &mype); /***** Pass in the dimension (X,Y) of the area to cover *****/ if (argc != 3){ int err = 0; printf("argc %d\n", argc); if (mype == MASTERPE){ printf("usage: mandelbrot nx ny"); MPI_Abort(MPI_COMM_WORLD,err ); } /* get command line args */ nx = atoi(argv[1]); ny = atoi(argv[2]); Source : Initialize MPI Environment Check if the input arguments : x,y dimensions of the region to be processed are passed
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring Mandelbrot Sets (source code) /* assume divides equally */ nrows_l = nx/nprocs; mystrt = mype*nrows_l; myend = mystrt + nrows_l - 1; /* create buffer for local work only */ data_l = (double *) malloc(nrows_l * ny * sizeof(double)); data_l_tmp = data_l; /* calc each procs coordinates and call local mandelbrot value generation function */ for (i = mystrt; i <= myend; ++i){ c.real = i/((double) nx) * ; for (j = 0; j < ny; ++j){ c.imag = j/((double) ny) * ; tmp = cal_pixel(c); *data_l++ = (double) tmp; } data_l = data_l_tmp; Source : Determining the dimensions of the work to be performed by each concurrent task. Local tasks calculate the coordinates for each pixel in the local region. For each pixel, cal_pixel() function is called and the corresponding value is calculated
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring Mandelbrot Sets (source code) if (mype == MASTERPE){ file = fopen("mandelbrot.bin_0000", "w"); printf("nrows_l, ny %d %d\n", nrows_l, ny); fwrite(data_l, nrows_l*ny, sizeof(double), file); fclose(file); for (i = 1; i < nprocs; ++i){ MPI_Recv(data_l, nrows_l * ny, MPI_DOUBLE, i, 0, MPI_COMM_WORLD, &status); printf("received message from proc %d\n", i); file = fopen("mandelbrot.bin_0000", "a"); fwrite(data_l, nrows_l*ny, sizeof(double), file); fclose(file); } else{ MPI_Send(data_l, nrows_l * ny, MPI_DOUBLE, MASTERPE, 0, MPI_COMM_WORLD); } MPI_Finalize(); } Source : Master process opens a file to store output into and stores its values in the file Master then waits to receive values computed by each of the worker processes Worker processes send computed mandelbrot values of their region to the master process
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring Demo : Mandelbrot Sets
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 Demo: Mandelbrot Sets 23
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring Topics Introduction Mandelbrot Sets Monte Carlo : PI Calculation Vector Dot-Product Matrix Multiplication
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 Monte Carlo Simulation Used when it is infeasible or impossible to compute an exact result with a deterministic algorithm Especially useful in –Studying systems with a large number of coupled degrees of freedom Fluids, disordered materials, strongly coupled solids, cellular structures –For modeling phenomena with significant uncertainty in inputs The calculation of risk in business –These methods are also widely used in mathematics The evaluation of definite integrals, particularly multidimensional integrals with complicated boundary conditions 26
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 Monte Carlo Simulation No single approach, multitude of different methods Usually follows pattern –Define a domain of possible inputs –Generate inputs randomly from the domain –Perform a deterministic computation using the inputs –Aggregate the results of the individual computations into the final result Example: calculate Pi 27
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring Monte Carlo: Algorithm for Pi The value of PI can be calculated in a number of ways. Consider the following method of approximating PI: Inscribe a circle in a square Randomly generate points in the square Determine the number of points in the square that are also in the circle Let r be the number of points in the circle divided by the number of points in the square PI ~ 4 r Note that the more points generated, the better the approximation Algorithm : npoints = circle_count = 0 do j = 1,npoints generate 2 random numbers between 0 and 1 xcoordinate = random1 ; ycoordinate = random2 if (xcoordinate, ycoordinate) inside circle then circle_count = circle_count + 1 end do PI = 4.0*circle_count/npoints
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring OpenMP Pi Calculation Initialize variables Initialize OpenMP parallel environment Calculate PI Print value of pi N WorkerThreads Master Thread Generate random X,Y Calculate Z=X^2+Y^2 If point lies within the circle Calculate Z =X^2+Y^2 If point lies within the circle Count ++ Reduction ∑ Y N NN YY
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 OpenMP Calculating Pi 31 #include #define SEED 42 main(int argc, char* argv) { int niter=0; double x,y; int i,tid,count=0; /* # of points in the 1st quadrant of unit circle */ double z; double pi; time_t rawtime; struct tm * timeinfo; printf("Enter the number of iterations used to estimate pi: "); scanf("%d",&niter); time ( &rawtime ); timeinfo = localtime ( &rawtime ); Seed for generating random number
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 OpenMP Calculating Pi 32 printf ( "The current date/time is: %s", asctime (timeinfo) ); /* initialize random numbers */ srand(SEED); #pragma omp parallel for private(x,y,z,tid) reduction(+:count) for ( i=0; i<niter; i++) { x = (double)rand()/RAND_MAX; y = (double)rand()/RAND_MAX; z = (x*x+y*y); if (z<=1) count++; if (i==(niter/6)-1) { tid = omp_get_thread_num(); printf(" thread %i just did iteration %i the count is %i\n",tid,i,count); } if (i==(niter/3)-1) { tid = omp_get_thread_num(); printf(" thread %i just did iteration %i the count is %i\n",tid,i,count); } if (i==(niter/2)-1) { tid = omp_get_thread_num(); printf(" thread %i just did iteration %i the count is %i\n",tid,i,count); } Initialize random number generator; srand is used to seed the random number generated by rand() Randomly generate x,y points Initialize OpenMP parallel for with reduction(∑) Calculate x^2+y^2 and check if it lies within the circle; if yes then increment count
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 Calculating Pi 33 if (i==(2*niter/3)-1) { tid = omp_get_thread_num(); printf(" thread %i just did iteration %i the count is %i\n",tid,i,count); } if (i==(5*niter/6)-1) { tid = omp_get_thread_num(); printf(" thread %i just did iteration %i the count is %i\n",tid,i,count); } if (i==niter-1) { tid = omp_get_thread_num(); printf(" thread %i just did iteration %i the count is %i\n",tid,i,count); } time ( &rawtime ); timeinfo = localtime ( &rawtime ); printf ( "The current date/time is: %s", asctime (timeinfo) ); printf(" the total count is %i\n",count); pi=(double)count/niter*4; printf("# of trials= %d, estimate of pi is %g \n",niter,pi); return 0; } Calculate PI based on the aggregate count of the points that lie within the circle
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 Demo : OpenMP Pi 34 l13]$./omcpi Enter the number of iterations used to estimate pi: The current date/time is: Tue Mar 4 05:53: thread 0 just did iteration the count is thread 1 just did iteration the count is 6514 thread 1 just did iteration the count is thread 2 just did iteration the count is thread 3 just did iteration the count is 6445 thread 3 just did iteration the count is The current date/time is: Tue Mar 4 05:53: the total count is # of trials= , estimate of pi is l13]$
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring Creating Custom Communicators Communicators define groups and the access patterns among them Default communicator is MPI_COMM_WORLD Some algorithms demand more sophisticated control of communications to take advantage of reduction operators MPI permits creation of custom communicators MPI_Comm_create
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring MPI Monte Carlo Pi Computation Initialize MPI Environment Receive Request Compute Random Array Send Array to Requestor Last Request? Finalize MPI Y N Server Initialize MPI Environment WorkerMaster Receive Error Bound Send Request to Server Receive Random Array Perform Computations Stop Condition Satisfied? Finalize MPI N Y Propagate Number of Points (Allreduce) Initialize MPI Environment Broadcast Error Bound Send Request to Server Receive Random Array Perform Computations Stop Condition Satisfied? Print Statistics N Y Propagate Number of Points (Allreduce) Finalize MPI Output Partial Result
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring Monte Carlo : MPI - Pi (source code) #include #include "mpi.h“ #define CHUNKSIZE 1000 #define INT_MAX #define REQUEST 1 #define REPLY 2 int main( int argc, char *argv[] ) { int iter; int in, out, i, iters, max, ix, iy, ranks[1], done, temp; double x, y, Pi, error, epsilon; int numprocs, myid, server, totalin, totalout, workerid; int rands[CHUNKSIZE], request; MPI_Comm world, workers; MPI_Group world_group, worker_group; MPI_Status status; MPI_Init(&argc,&argv); world = MPI_COMM_WORLD; MPI_Comm_size(world,&numprocs); MPI_Comm_rank(world,&myid); Initialize MPI environment
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring Monte Carlo : MPI - Pi (source code) server = numprocs-1;/* last proc is server */ if (myid == 0) sscanf( argv[1], "%lf", &epsilon ); MPI_Bcast( &epsilon, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD ); MPI_Comm_group( world, &world_group ); ranks[0] = server; MPI_Group_excl( world_group, 1, ranks, &worker_group ); MPI_Comm_create( world, worker_group, &workers ); MPI_Group_free(&worker_group); if (myid == server) { do { MPI_Recv(&request, 1, MPI_INT, MPI_ANY_SOURCE, REQUEST, world, &status); if (request) { for (i = 0; i < CHUNKSIZE; ) { rands[i] = random(); if (rands[i] <= INT_MAX) i++; }/* Send random number array*/ MPI_Send(rands, CHUNKSIZE, MPI_INT, status.MPI_SOURCE, REPLY, world); } } while( request>0 ); } else { /* Begin Worker Block */ request = 1; done = in = out = 0; max = INT_MAX; /* max int, for normalization */ MPI_Send( &request, 1, MPI_INT, server, REQUEST, world ); MPI_Comm_rank( workers, &workerid ); iter = 0; Broadcast Error Bounds: epsilon Create a custom communicator Server process : 1. Receives request to generate a random,2. Computes the random number array, 3. Send array to requestor Worker process : Request the server to generate a random number array
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring Monte Carlo : MPI - Pi (source code) while (!done) { iter++; request = 1; /* Recv. random array from server*/ MPI_Recv( rands, CHUNKSIZE, MPI_INT, server, REPLY, world, &status ); for (i=0; i<CHUNKSIZE-1; ) { x = (((double) rands[i++])/max) * 2 - 1; y = (((double) rands[i++])/max) * 2 - 1; if (x*x + y*y < 1.0) in++; else out++; } MPI_Allreduce(&in, &totalin, 1, MPI_INT, MPI_SUM, workers); MPI_Allreduce(&out, &totalout, 1, MPI_INT, MPI_SUM, workers); Pi = (4.0*totalin)/(totalin + totalout); error = fabs( Pi ); done = (error ); request = (done) ? 0 : 1; if (myid == 0) {/* If “Master” : Print current value of PI */ printf( "\rpi = %23.20f", Pi ); MPI_Send( &request, 1, MPI_INT, server, REQUEST, world ); } else { /* If “Worker” : Request new array if not finished */ if (request) MPI_Send(&request, 1, MPI_INT, server, REQUEST, world); } MPI_Comm_free(&workers); } Worker : Receive random number array from the Server Worker: For each pair of x,y in the random number array, calculate the coordinates Determine if the number is inside or out of the circle Print current value of PI and request for more work Compute the value of pi and Check if error is within threshhold
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring Monte Carlo : MPI - Pi (source code) if (myid == 0) { /* If “Master” : Print Results */ printf( "\npoints: %d\nin: %d, out: %d, to exit\n", totalin+totalout, totalin, totalout ); getchar(); } MPI_Finalize(); } Print the final value of PI
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring Demo : MPI Monte Carlo, Pi > mpirun –np 4 monte 1e-20 pi = points: in: , out: > mpirun –np 4 monte 1e-20 pi = points: in: , out:
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring Topics Introduction Mandelbrot Sets Monte Carlo : PI Calculation Vector Dot-Product Matrix Multiplication
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 Vector Dot Product Multiplication of 2 vectors followed by Summation 43 A[i] X1X1 X2X2 X3X3 X4X4 X5X5 … XnXn B[i] Y1Y1 Y2Y2 Y3Y3 Y4Y4 Y5Y5 … YnYn ∙ = A[i] * B[i] X 1 * Y 1 X 2 * Y 2 X 3 * Y 3 X 4 * Y 4 X 5 * Y 5 … X n * Y n
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring OpenMP Dot Product : using Reduction Initialize variables Initialize OpenMP parallel environment Calculate local computations REDUCTION : ∑ Print value of Dot Product N WorkerThreads Master Thread Workload and schedule is determined by OpenMP during runtime
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 OpenMP Dot Product 45 #include main () { int i, n, chunk; float a[16], b[16], result; n = 16; chunk = 4; result = 0.0; for (i=0; i < n; i++) { a[i] = i * 1.0; b[i] = i * 2.0; } #pragma omp parallel for default(shared) private(i) \ schedule(static,chunk) reduction(+:result) for (i=0; i < n; i++) result = result + (a[i] * b[i]); printf("Final result= %f\n",result); } Reduction example with summation where the result of the reduction operation stores the dotproduct of two vectors ∑a[i]*b[i] Reduction example with summation where the result of the reduction operation stores the dotproduct of two vectors ∑a[i]*b[i] SRC :
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 Demo: Dot Product using Reduction 46 l12]$./reduction a[i] b[i] a[i]*b[i] Final result= l12]$
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring MPI Dot Product Computation Initialize Variables Worker Master Initialize MPI environment Receive Size of vectors Receive local workload for Vector A Receive local workload for Vector B Initialize Variables Initialize MPI Environment Broadcast Size of Vectors Get Vector A & Distribute Partitioned Vector A Get Vector B & Distribute Partitioned Vector B Calculate dot-product for local workloads Print Result REDUCTION ∑ Calculate dot-product for local workloads
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 MPI Dot Product 48 #include #include "mpi.h" #define MAX_LOCAL_ORDER 100 main(int argc, char* argv[]) { float local_x[MAX_LOCAL_ORDER]; float local_y[MAX_LOCAL_ORDER]; int n; int n_bar; /* = n/p */ float dot; int p; int my_rank; void Read_vector(char* prompt, float local_v[], int n_bar, int p, int my_rank); float Parallel_dot(float local_x[], float local_y[], int n_bar); MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &p); MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); if (my_rank == 0) { printf("Enter the order of the vectors\n"); scanf("%d", &n); } MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD); Initialize MPI Environment Broadcast the order of vectors across the workers Parallel Programming with MPI by Peter Pacheco
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 MPI Dot Product 49 n_bar = n/p; Read_vector("the first vector", local_x, n_bar, p, my_rank); Read_vector("the second vector", local_y, n_bar, p, my_rank); dot = Parallel_dot(local_x, local_y, n_bar); if (my_rank == 0) printf("The dot product is %f\n", dot); MPI_Finalize(); } /* main */ void Read_vector( char* prompt /* in */, float local_v[] /* out */, int n_bar /* in */, int p /* in */, int my_rank /* in */) { int i, q; Receive and distribute the two vectors Calculate the parallel dot product for local workloads Master: Print the result of the dot product Parallel Programming with MPI by Peter Pacheco
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 MPI Dot Product 50 float temp[MAX_LOCAL_ORDER]; MPI_Status status; if (my_rank == 0) { printf("Enter %s\n", prompt); for (i = 0; i < n_bar; i++) scanf("%f", &local_v[i]); for (q = 1; q < p; q++) { for (i = 0; i < n_bar; i++) scanf("%f", &temp[i]); MPI_Send(temp, n_bar, MPI_FLOAT, q, 0, MPI_COMM_WORLD); } } else { MPI_Recv(local_v, n_bar, MPI_FLOAT, 0, 0, MPI_COMM_WORLD, &status); } } /* Read_vector */ float Serial_dot( float x[] /* in */, MASTER: Get the input from the User prepare the local workload Get the input from the User load balance in real-time by storing the work chunks in array And sending the array to the worker nodes for processing Get the input from the User load balance in real-time by storing the work chunks in array And sending the array to the worker nodes for processing Worker : Receive the local workload to be processed Serial_dot() : calculates the dot product on local arrays Parallel Programming with MPI by Peter Pacheco
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 MPI Dot Product 51 float y[] /* in */, int n /* in */) { int i; float sum = 0.0; for (i = 0; i < n; i++) sum = sum + x[i]*y[i]; return sum; } /* Serial_dot */ float Parallel_dot( float local_x[] /* in */, float local_y[] /* in */, int n_bar /* in */) { float local_dot; float dot = 0.0; local_dot = Serial_dot(local_x, local_y, n_bar); MPI_Reduce(&local_dot, &dot, 1, MPI_FLOAT, MPI_SUM, 0, MPI_COMM_WORLD); return dot; } /* Parallel_dot */ Serial_dot() : calculates the dot product on local arrays Parallel_dot() : Calls the Serial_dot() to perform the dot product for local workload Calculate the dotproduct and calculate summation using collective MPI_REDUCE calls (SUM) Parallel Programming with MPI by Peter Pacheco
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 Demo: MPI Dot Product 52 l13]$ mpirun …../mpi_dot Enter the order of the vectors 16 Enter the first vector Enter the second vector The dot product is l13]$
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring Topics Introduction Mandelbrot Sets Monte Carlo : PI Calculation Vector Dot-Product Matrix Multiplication
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. Matrix Vector Multiplication
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring Matrix-Vector Multiplication c = A xb
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring Implementing Matrix Multiplication Sequential Code Assume throughout that the matrices are square (n x n matrices). The sequential code to compute A x B could simply be for (i = 0; i < n; i++) for (j = 0; j < n; j++) { c[i][j] = 0; for (k = 0; k < n; k++) c[i][j] = c[i][j] + a[i][k] * b[k][j]; } This algorithm requires n 3 multiplications and n 3 additions, leading to a sequential time complexity of O(n 3 ). Very easy to parallelize. Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved.
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 Implementing Matrix Multiplication With n processors (and n x n matrices), we can obtain: Time complexity of O(n 2 ) with n processors Each instance of inner loop is independent and can be done by a separate processor Time complexity of O(n) with n 2 processors One element of A and B assigned to each processor. Cost optimal since O(n 3 ) = n x O(n 2 ) = n 2 x O(n). Time complexity of O(log n) with n 3 processors By parallelizing the inner loop. Not cost-optimal since O(n 3 ) < n 3 x O(log n). O(log n) lower bound for parallel matrix multiplication. 57
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring Block Matrix Multiplication Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. Partitioning into sub-matricies
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. Matrix Multiplication
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring Performance Improvement Using tree construction n numbers can be added in O(log n) steps (using n 3 processors): Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved.
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring OpenMP: Flowchart for Matrix Multiplication Initialize variables & matrices Initialize OpenMP Environment Compute the Matrix product for the local workload Print Results Compute the Matrix product for the local workload Schedule and workload chunksize are determined based on user preferences during compile/run time Since each thread works on portion of the array and updates different parts of the same array synchronization is not needed
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 OpenMP Matrix Multiplication 62 #include /* Main Program */ main() { int NoofRows_A, NoofCols_A, NoofRows_B, NoofCols_B, i, j, k; NoofRows_A = NoofCols_A = NoofRows_B = NoofCols_B = 4; float Matrix_A[NoofRows_A][NoofCols_A]; float Matrix_B[NoofRows_B][NoofCols_B]; float Result[NoofRows_A][NoofCols_B]; for (i = 0; i < NoofRows_A; i++) { for (j = 0; j < NoofCols_A; j++) Matrix_A[i][j] = i + j; } /* Matrix_B Elements */ for (i = 0; i < NoofRows_B; i++) { for (j = 0; j < NoofCols_B; j++) Matrix_B[i][j] = i + j; } printf("The Matrix_A Is \n"); Initialize the two Matrices A[][] & B[][] with sum of their index values SRC :
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 OpenMP Matrix Multiplication 63 for (i = 0; i < NoofRows_A; i++) { for (j = 0; j < NoofCols_A; j++) printf("%f \t", Matrix_A[i][j]); printf("\n"); } printf("The Matrix_B Is \n"); for (i = 0; i < NoofRows_B; i++) { for (j = 0; j < NoofCols_B; j++) printf("%f \t", Matrix_B[i][j]); printf("\n"); } for (i = 0; i < NoofRows_A; i++) { for (j = 0; j < NoofCols_B; j++) { Result[i][j] = 0.0; } #pragma omp parallel for private(j,k) for (i = 0; i < NoofRows_A; i = i + 1) for (j = 0; j < NoofCols_B; j = j + 1) for (k = 0; k < NoofCols_A; k = k + 1) Result[i][j] = Result[i][j] + Matrix_A[i][k] * Matrix_B[k][j]; printf("\nThe Matrix Computation Result Is \n"); Initialize the results matrix with 0.0 Print the Matrices for debugging purposes Using OpenMP parallel For directive: Calculate the product of the two matrices Loadbalancing is done based on the values of OpenMP environment variables and the number of threads SRC :
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 OpenMP Matrix Multiplicaton 64 for (i = 0; i < NoofRows_A; i = i + 1) { for (j = 0; j < NoofCols_B; j = j + 1) printf("%f ", Result[i][j]); printf("\n"); } SRC :
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 DEMO : OpenMP Matrix Multiplication 65 l13]$./omp_mm The Matrix_A Is The Matrix_B Is The Matrix Computation Result Is l13]$
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring Flowchart for MPI Matrix Multiplication “master”“workers” Initialize MPI Environment … Initialize Array Partition Array into workloads Send Workload to “workers” Recv. work … wait for “workers“ to finish task Calculate matrix product … Send result … Recv. results Print results End
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring Matrix Multiplication (source code) #include "mpi.h" #include #define NRA 4 /* number of rows in matrix A */ #define NCA 4 /* number of columns in matrix A */ #define NCB 4 /* number of columns in matrix B */ #define MASTER 0 /* taskid of first task */ #define FROM_MASTER 1 /* setting a message type */ #define FROM_WORKER 2 /* setting a message type */ int main(argc,argv) int argc; char *argv[]; { intnumtasks, /* number of tasks in partition */ taskid, /* a task identifier */ numworkers, /* number of worker tasks */ source, /* task id of message source */ dest, /* task id of message destination */ mtype, /* message type */ rows, /* rows of matrix A sent to each worker */ averow, extra, offset, /* used to determine rows sent to each worker */ i, j, k, rc; /* misc */ doublea[NRA][NCA], /* matrix A to be multiplied */ b[NCA][NCB], /* matrix B to be multiplied */ c[NRA][NCB]; /* result matrix C */ MPI_Status status; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD,&taskid); MPI_Comm_size(MPI_COMM_WORLD,&numtasks); Source : Initialize the MPI environment Source : tutorials/mpi/samples/C/mpi_mm.c
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring Matrix Multiplication (source code) if (numtasks < 2 ) { printf("Need at least two MPI tasks. Quitting...\n"); MPI_Abort(MPI_COMM_WORLD, rc); exit(1); } numworkers = numtasks-1; if (taskid == MASTER){ for (i=0; i<NRA; i++) for (j=0; j<NCA; j++){ a[i][j]= i+j+1; b[i][j]= i+j+1; } printf("Matrix A :: \n"); for (i=0; i<NRA; i++){ printf("\n"); for (j=0; j<NCB; j++) printf("%6.2f ", a[i][j]); } printf("Matrix B :: \n"); for (i=0; i<NRA; i++) { printf("\n"); for (j=0; j<NCB; j++) printf("%6.2f ", b[i][j]); averow = NRA/numworkers; extra = NRA%numworkers; offset = 0; mtype = FROM_MASTER; Source : tutorials/mpi/samples/C/mpi_mm.c MASTER: Initialize the matrix A & B Print the two matrices for Debugging purposes Calculate the number of rows to be processed by each worker Calculate the number of overflow rows to be processed additionally by each worker
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring Matrix Multiplication (source code) for (dest=1; dest<=numworkers; dest++) {/* To each worker send : Start point, number of rows to process, and sub-arrays to process */ rows = (dest <= extra) ? averow+1 : averow; printf("Sending %d rows to task %d offset=%d\n",rows,dest,offset); MPI_Send(&offset, 1, MPI_INT, dest, mtype, MPI_COMM_WORLD); MPI_Send(&rows, 1, MPI_INT, dest, mtype, MPI_COMM_WORLD); MPI_Send(&a[offset][0], rows*NCA, MPI_DOUBLE, dest, mtype, MPI_COMM_WORLD); MPI_Send(&b, NCA*NCB, MPI_DOUBLE, dest, mtype, MPI_COMM_WORLD); offset = offset + rows; } /* Receive results from worker tasks */ mtype = FROM_WORKER; /* Message tag for messages sent by “workers” */ for (i=1; i<=numworkers; i++) { source = i; /* offset stores the (processing) starting point of work chunk */ MPI_Recv(&offset, 1, MPI_INT, source, mtype, MPI_COMM_WORLD, &status); MPI_Recv(&rows, 1, MPI_INT, source, mtype, MPI_COMM_WORLD, &status); MPI_Recv(&c[offset][0], rows*NCB, MPI_DOUBLE, source, mtype, MPI_COMM_WORLD, &status); printf("Received results from task %d\n",source); } printf("******************************************************\n"); printf("Result Matrix:\n"); for (i=0; i<NRA; i++) { printf("\n"); for (j=0; j<NCB; j++) printf("%6.2f ", c[i][j]); } printf("\n******************************************************\n"); printf ("Done.\n"); } MASTER : Send the workload chunk across to each of the worker MASTER: Receive the workload chunk from the workers c[][] contains the matrix products calculated for each workload chunk by the corresponding worker Source : tutorials/mpi/samples/C/mpi_mm.c
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring Matrix Multiplication (source code) /**************************** worker task ************************************/ if (taskid > MASTER) { mtype = FROM_MASTER; MPI_Recv(&offset, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD, &status); MPI_Recv(&rows, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD, &status); MPI_Recv(&a, rows*NCA, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD, &status); MPI_Recv(&b, NCA*NCB, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD, &status); for (k=0; k<NCB; k++) for (i=0; i<rows; i++) { c[i][k] = 0.0; for (j=0; j<NCA; j++) /* Calculate the product and store result in C */ c[i][k] = c[i][k] + a[i][j] * b[j][k]; } mtype = FROM_WORKER; MPI_Send(&offset, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD); MPI_Send(&rows, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD); /* Worker sends the resultant array to the master */ MPI_Send(&c, rows*NCB, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD); } MPI_Finalize(); } Source : WORKER: Receive the workload to be processed by each worker Calculate the matrix product and store the result in c[][] Send the computed results array to the Master Source : puting/tutorials/mpi/sample s/C/mpi_mm.c
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring Demo : Matrix Multiplication matrix_multiplication]$ mpirun -np 4 -machinefile ~/hosts./mpi_mm mpi_mm has started with 4 tasks. Initializing arrays... Matrix A :: Matrix B :: Sending 2 rows to task 1 offset=0 Sending 1 rows to task 2 offset=2 Sending 1 rows to task 3 offset=3 Received results from task 1 Received results from task 2 Received results from task 3 Result Matrix: matrix_multiplication]$
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring