CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof.

Slides:



Advertisements
Similar presentations
Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Advertisements

Practical techniques & Examples
MPI Program Structure Self Test with solution. Self Test 1.How would you modify "Hello World" so that only even-numbered processors print the greeting.
Computational Physics Lecture 4 Dr. Guy Tel-Zur.
Embarrassingly Parallel Computations Partitioning and Divide-and-Conquer Strategies Pipelined Computations Synchronous Computations Asynchronous Computations.
Comp 422: Parallel Programming Lecture 8: Message Passing (MPI)
12d.1 Two Example Parallel Programs using MPI UNC-Wilmington, C. Ferner, 2007 Mar 209, 2007.
Lecture 8 Objectives Material from Chapter 9 More complete introduction of MPI functions Show how to implement manager-worker programs Parallel Algorithms.
Embarrassingly Parallel Computations Partitioning and Divide-and-Conquer Strategies Pipelined Computations Synchronous Computations Asynchronous Computations.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Today Objectives Chapter 6 of Quinn Creating 2-D arrays Thinking about “grain size” Introducing point-to-point communications Reading and printing 2-D.
Monte Carlo Simulation Used when it is infeasible or impossible to compute an exact result with a deterministic algorithm Especially useful in –Studying.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Programming with Shared Memory Introduction to OpenMP
Parallel Processing1 Parallel Processing (CS 676) Lecture 7: Message Passing using MPI * Jeremy R. Johnson *Parts of this lecture was derived from chapters.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Scientific Computing Lecture 5 Dr. Guy Tel-Zur Autumn Colors, by Bobby Mikul, Mikul Autumn Colors, by Bobby Mikul,
1 MPI: Message-Passing Interface Chapter 2. 2 MPI - (Message Passing Interface) Message passing library standard (MPI) is developed by group of academics.
Part I MPI from scratch. Part I By: Camilo A. SilvaBIOinformatics Summer 2008 PIRE :: REU :: Cyberbridges.
Parallel Programming with MPI Prof. Sivarama Dandamudi School of Computer Science Carleton University.
Message Passing Programming with MPI Introduction to MPI Basic MPI functions Most of the MPI materials are obtained from William Gropp and Rusty Lusk’s.
Hybrid MPI and OpenMP Parallel Programming
CS 838: Pervasive Parallelism Introduction to MPI Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from an online tutorial.
Message Passing Programming Model AMANO, Hideharu Textbook pp. 140-147.
Summary of MPI commands Luis Basurto. Large scale systems Shared Memory systems – Memory is shared among processors Distributed memory systems – Each.
MPI Introduction to MPI Commands. Basics – Send and Receive MPI is a message passing environment. The processors’ method of sharing information is NOT.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Parallel Programming with MPI By, Santosh K Jena..
Embarrassingly Parallel Computations Partitioning and Divide-and-Conquer Strategies Pipelined Computations Synchronous Computations Asynchronous Computations.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, ©
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
CSCI-455/522 Introduction to High Performance Computing Lecture 4.
CSCI-455/552 Introduction to High Performance Computing Lecture 11.5.
1 Message Passing Models CEG 4131 Computer Architecture III Miodrag Bolic.
CSC 7600 Lecture 8 : MPI2 Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS MESSAGE PASSING INTERFACE MPI (PART B) Prof. Thomas Sterling.
Oct. 23, 2002Parallel Processing1 Parallel Processing (CS 730) Lecture 6: Message Passing using MPI * Jeremy R. Johnson *Parts of this lecture was derived.
High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University.
Message Passing and MPI Laxmikant Kale CS Message Passing Program consists of independent processes, –Each running in its own address space –Processors.
Introduction to MPI CDP 1. Shared Memory vs. Message Passing Shared Memory Implicit communication via memory operations (load/store/lock) Global address.
MPI and OpenMP.
Programming distributed memory systems: Message Passing Interface (MPI) Distributed memory systems: multiple processing units working on one task (e.g.
An Introduction to MPI (message passing interface)
CSCI-455/552 Introduction to High Performance Computing Lecture 23.
2.1 Collective Communication Involves set of processes, defined by an intra-communicator. Message tags not present. Principal collective operations: MPI_BCAST()
Message Passing Interface (MPI) by Blaise Barney, Lawrence Livermore National Laboratory.
Suzaku Pattern Programming Framework (a) Structure and low level patterns © 2015 B. Wilkinson Suzaku.pptx Modification date February 22,
ITCS 4/5145 Parallel Computing, UNC-Charlotte, B
1 ITCS4145 Parallel Programming B. Wilkinson March 23, hybrid-abw.ppt Hybrid Parallel Programming Introduction.
Introduction to MPI.
MPI Message Passing Interface
Introduction to OpenMP
Send and Receive.
CS 584.
Using compiler-directed approach to create MPI code automatically
Send and Receive.
ITCS 4/5145 Parallel Computing, UNC-Charlotte, B
Prof. Thomas Sterling Department of Computer Science
Lecture 14: Inter-process Communication
Hybrid Parallel Programming
Introduction to parallelism and the Message Passing Interface
ITCS 4/5145 Parallel Computing, UNC-Charlotte, B
Hybrid Parallel Programming
Using compiler-directed approach to create MPI code automatically
Parallel Techniques • Embarrassingly Parallel Computations
Embarrassingly Parallel Computations
Hybrid Parallel Programming
Patterns Paraguin Compiler Version 2.1.
Hybrid Parallel Programming
CS 584 Lecture 8 Assignment?.
Presentation transcript:

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof. Thomas Sterling Dr. Hartmut Kaiser Department of Computer Science Louisiana State University March 10 th, 2011

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 Dr. Hartmut Kaiser Center for Computation & Technology R315 Johnston 2

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 Puzzle of the Day What’s the difference between the following valid C function declarations: void foo(); void foo(void); void foo(…);

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 Puzzle of the Day What’s the difference between the following valid C function declarations: What’s the difference between the following valid C++ function declarations: void foo(); void foo(void); void foo(…); void foo();  any number of parameters void foo(void);  no parameter void foo(…);  any number of parameters

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 Puzzle of the Day What’s the difference between the following valid C function declarations: void foo();  any number of parameters void foo(void);  no parameters void foo(…);  any number of parameters What’s the difference between the following valid C++ function declarations: void foo();  no parameters void foo(void);  no parameters void foo(…);  any number of parameters

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring Topics Introduction Mandelbrot Sets Monte Carlo : PI Calculation Vector Dot-Product Matrix Multiplication

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring Topics Introduction Mandelbrot Sets Monte Carlo : PI Calculation Vector Dot-Product Matrix Multiplication

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring Parallel Programming Goals –Correctness –Reduction in execution time –Efficiency –Scalability –Increased problem size and richness of models Objectives –Expose parallelism Algorithm design –Distribute work uniformly Data decomposition and allocation Dynamic load balancing –Minimize overhead of synchronization and communication Coarse granularity Big messages –Minimize redundant work Still sometimes better than communication

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring Basic Parallel (MPI) Program Steps Establish logical bindings Initialize application execution environment Distribute data and work Perform core computations in parallel (across nodes) Synchronize and Exchange intermediate data results –Optional for non-embarrassingly parallel (cooperative) Detect “stop” condition –Maybe implicit with a barrier etc. Aggregate final results –Often a reduction operator Output results and error code Terminate and return to OS

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring “embarrassingly parallel” Common phrase –poorly defined, –widely used Suggests lots and lots of parallelism –with essentially no inter task communication or coordination –Highly partitionable workload with minimal overhead “almost embarrassingly parallel” –Same as above, but –Requires master to launch many tasks –Requires master to collect final results of tasks –Sometimes still referred to as “embarrassingly parallel”

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring Topics Introduction Mandelbrot Sets Monte Carlo : PI Calculation Vector Dot-Product Matrix Multiplication

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 Mandelbrot set Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. 12

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. Mandelbrot Set Set of points in a complex plane that are quasi-stable (will increase and decrease, but not exceed some limit) when computed by iterating the function where z k+1 is the (k + 1) th iteration of the complex number z = (a + bi) and c is a complex number giving position of point in the complex plane. The initial value for z is zero. Iterations continued until magnitude of z is greater than 2 or number of iterations reaches arbitrary limit. Magnitude of z is the length of the vector given by 13

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. Sequential routine computing value of one point returning number of iterations structure complex { float real; float imag; }; int cal_pixel(complex c) { int count, max; complex z; float temp, lengthsq; max = 256; z.real = 0; z.imag = 0; count = 0; /* number of iterations */ do { temp = z.real * z.real - z.imag * z.imag + c.real; z.imag = 2 * z.real * z.imag + c.imag; z.real = temp; lengthsq = z.real * z.real + z.imag * z.imag; count++; } while ((lengthsq < 4.0) && (count < max)); return count; } 14

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 Parallelizing Mandelbrot Set Computation Static Task Assignment Simply divide the region into fixed number of parts, each computed by a separate processor. Not very successful because different regions require different numbers of iterations and time. Dynamic Task Assignment Have processor request regions after computing previous regions Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. 15

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. Dynamic Task Assignment Work Pool/Processor Farms 16

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring Flowchart for Mandelbrot Set Generation “master”“workers” Initialize MPI Environment … Create Local Workload buffer … Isolate work regions Calculate Mandelbrot set values across work region … … Write result from task 0 to file Recv. results from “workers” Send result to “master” … Concatenate results to file End

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring Mandelbrot Sets (source code) #include typedef struct complex{ double real; double imag; } Complex; int cal_pixel(Complex c){ int count, max_iter; Complex z; double temp, lengthsq; max_iter = 256; z.real = 0; z.imag = 0; count = 0; do{ temp = z.real * z.real - z.imag * z.imag + c.real; z.imag = 2 * z.real * z.imag + c.imag; z.real = temp; lengthsq = z.real * z.real + z.imag * z.imag; count ++; } while ((lengthsq < 4.0) && (count < max_iter)); return(count); } Source : cal_pixel () runs on every worker process calculates the : for every pixel

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring Mandelbrot Sets (source code) #define MASTERPE 0 int main(int argc, char **argv){ FILE *file; int i, j; int tmp; Complex c; double *data_l, *data_l_tmp; int nx, ny; int mystrt, myend; int nrows_l; int nprocs, mype; MPI_Status status; /***** Initializing MPI Environment*****/ MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); MPI_Comm_rank(MPI_COMM_WORLD, &mype); /***** Pass in the dimension (X,Y) of the area to cover *****/ if (argc != 3){ int err = 0; printf("argc %d\n", argc); if (mype == MASTERPE){ printf("usage: mandelbrot nx ny"); MPI_Abort(MPI_COMM_WORLD,err ); } /* get command line args */ nx = atoi(argv[1]); ny = atoi(argv[2]); Source : Initialize MPI Environment Check if the input arguments : x,y dimensions of the region to be processed are passed

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring Mandelbrot Sets (source code) /* assume divides equally */ nrows_l = nx/nprocs; mystrt = mype*nrows_l; myend = mystrt + nrows_l - 1; /* create buffer for local work only */ data_l = (double *) malloc(nrows_l * ny * sizeof(double)); data_l_tmp = data_l; /* calc each procs coordinates and call local mandelbrot value generation function */ for (i = mystrt; i <= myend; ++i){ c.real = i/((double) nx) * ; for (j = 0; j < ny; ++j){ c.imag = j/((double) ny) * ; tmp = cal_pixel(c); *data_l++ = (double) tmp; } data_l = data_l_tmp; Source : Determining the dimensions of the work to be performed by each concurrent task. Local tasks calculate the coordinates for each pixel in the local region. For each pixel, cal_pixel() function is called and the corresponding value is calculated

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring Mandelbrot Sets (source code) if (mype == MASTERPE){ file = fopen("mandelbrot.bin_0000", "w"); printf("nrows_l, ny %d %d\n", nrows_l, ny); fwrite(data_l, nrows_l*ny, sizeof(double), file); fclose(file); for (i = 1; i < nprocs; ++i){ MPI_Recv(data_l, nrows_l * ny, MPI_DOUBLE, i, 0, MPI_COMM_WORLD, &status); printf("received message from proc %d\n", i); file = fopen("mandelbrot.bin_0000", "a"); fwrite(data_l, nrows_l*ny, sizeof(double), file); fclose(file); } else{ MPI_Send(data_l, nrows_l * ny, MPI_DOUBLE, MASTERPE, 0, MPI_COMM_WORLD); } MPI_Finalize(); } Source : Master process opens a file to store output into and stores its values in the file Master then waits to receive values computed by each of the worker processes Worker processes send computed mandelbrot values of their region to the master process

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring Demo : Mandelbrot Sets

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 Demo: Mandelbrot Sets 23

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring Topics Introduction Mandelbrot Sets Monte Carlo : PI Calculation Vector Dot-Product Matrix Multiplication

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 Monte Carlo Simulation Used when it is infeasible or impossible to compute an exact result with a deterministic algorithm Especially useful in –Studying systems with a large number of coupled degrees of freedom Fluids, disordered materials, strongly coupled solids, cellular structures –For modeling phenomena with significant uncertainty in inputs The calculation of risk in business –These methods are also widely used in mathematics The evaluation of definite integrals, particularly multidimensional integrals with complicated boundary conditions 26

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 Monte Carlo Simulation No single approach, multitude of different methods Usually follows pattern –Define a domain of possible inputs –Generate inputs randomly from the domain –Perform a deterministic computation using the inputs –Aggregate the results of the individual computations into the final result Example: calculate Pi 27

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring Monte Carlo: Algorithm for Pi The value of PI can be calculated in a number of ways. Consider the following method of approximating PI: Inscribe a circle in a square Randomly generate points in the square Determine the number of points in the square that are also in the circle Let r be the number of points in the circle divided by the number of points in the square PI ~ 4 r Note that the more points generated, the better the approximation Algorithm : npoints = circle_count = 0 do j = 1,npoints generate 2 random numbers between 0 and 1 xcoordinate = random1 ; ycoordinate = random2 if (xcoordinate, ycoordinate) inside circle then circle_count = circle_count + 1 end do PI = 4.0*circle_count/npoints

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring OpenMP Pi Calculation Initialize variables Initialize OpenMP parallel environment Calculate PI Print value of pi N WorkerThreads Master Thread Generate random X,Y Calculate Z=X^2+Y^2 If point lies within the circle Calculate Z =X^2+Y^2 If point lies within the circle Count ++ Reduction ∑ Y N NN YY

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 OpenMP Calculating Pi 31 #include #define SEED 42 main(int argc, char* argv) { int niter=0; double x,y; int i,tid,count=0; /* # of points in the 1st quadrant of unit circle */ double z; double pi; time_t rawtime; struct tm * timeinfo; printf("Enter the number of iterations used to estimate pi: "); scanf("%d",&niter); time ( &rawtime ); timeinfo = localtime ( &rawtime ); Seed for generating random number

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 OpenMP Calculating Pi 32 printf ( "The current date/time is: %s", asctime (timeinfo) ); /* initialize random numbers */ srand(SEED); #pragma omp parallel for private(x,y,z,tid) reduction(+:count) for ( i=0; i<niter; i++) { x = (double)rand()/RAND_MAX; y = (double)rand()/RAND_MAX; z = (x*x+y*y); if (z<=1) count++; if (i==(niter/6)-1) { tid = omp_get_thread_num(); printf(" thread %i just did iteration %i the count is %i\n",tid,i,count); } if (i==(niter/3)-1) { tid = omp_get_thread_num(); printf(" thread %i just did iteration %i the count is %i\n",tid,i,count); } if (i==(niter/2)-1) { tid = omp_get_thread_num(); printf(" thread %i just did iteration %i the count is %i\n",tid,i,count); } Initialize random number generator; srand is used to seed the random number generated by rand() Randomly generate x,y points Initialize OpenMP parallel for with reduction(∑) Calculate x^2+y^2 and check if it lies within the circle; if yes then increment count

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 Calculating Pi 33 if (i==(2*niter/3)-1) { tid = omp_get_thread_num(); printf(" thread %i just did iteration %i the count is %i\n",tid,i,count); } if (i==(5*niter/6)-1) { tid = omp_get_thread_num(); printf(" thread %i just did iteration %i the count is %i\n",tid,i,count); } if (i==niter-1) { tid = omp_get_thread_num(); printf(" thread %i just did iteration %i the count is %i\n",tid,i,count); } time ( &rawtime ); timeinfo = localtime ( &rawtime ); printf ( "The current date/time is: %s", asctime (timeinfo) ); printf(" the total count is %i\n",count); pi=(double)count/niter*4; printf("# of trials= %d, estimate of pi is %g \n",niter,pi); return 0; } Calculate PI based on the aggregate count of the points that lie within the circle

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 Demo : OpenMP Pi 34 l13]$./omcpi Enter the number of iterations used to estimate pi: The current date/time is: Tue Mar 4 05:53: thread 0 just did iteration the count is thread 1 just did iteration the count is 6514 thread 1 just did iteration the count is thread 2 just did iteration the count is thread 3 just did iteration the count is 6445 thread 3 just did iteration the count is The current date/time is: Tue Mar 4 05:53: the total count is # of trials= , estimate of pi is l13]$

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring Creating Custom Communicators Communicators define groups and the access patterns among them Default communicator is MPI_COMM_WORLD Some algorithms demand more sophisticated control of communications to take advantage of reduction operators MPI permits creation of custom communicators MPI_Comm_create

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring MPI Monte Carlo Pi Computation Initialize MPI Environment Receive Request Compute Random Array Send Array to Requestor Last Request? Finalize MPI Y N Server Initialize MPI Environment WorkerMaster Receive Error Bound Send Request to Server Receive Random Array Perform Computations Stop Condition Satisfied? Finalize MPI N Y Propagate Number of Points (Allreduce) Initialize MPI Environment Broadcast Error Bound Send Request to Server Receive Random Array Perform Computations Stop Condition Satisfied? Print Statistics N Y Propagate Number of Points (Allreduce) Finalize MPI Output Partial Result

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring Monte Carlo : MPI - Pi (source code) #include #include "mpi.h“ #define CHUNKSIZE 1000 #define INT_MAX #define REQUEST 1 #define REPLY 2 int main( int argc, char *argv[] ) { int iter; int in, out, i, iters, max, ix, iy, ranks[1], done, temp; double x, y, Pi, error, epsilon; int numprocs, myid, server, totalin, totalout, workerid; int rands[CHUNKSIZE], request; MPI_Comm world, workers; MPI_Group world_group, worker_group; MPI_Status status; MPI_Init(&argc,&argv); world = MPI_COMM_WORLD; MPI_Comm_size(world,&numprocs); MPI_Comm_rank(world,&myid); Initialize MPI environment

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring Monte Carlo : MPI - Pi (source code) server = numprocs-1;/* last proc is server */ if (myid == 0) sscanf( argv[1], "%lf", &epsilon ); MPI_Bcast( &epsilon, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD ); MPI_Comm_group( world, &world_group ); ranks[0] = server; MPI_Group_excl( world_group, 1, ranks, &worker_group ); MPI_Comm_create( world, worker_group, &workers ); MPI_Group_free(&worker_group); if (myid == server) { do { MPI_Recv(&request, 1, MPI_INT, MPI_ANY_SOURCE, REQUEST, world, &status); if (request) { for (i = 0; i < CHUNKSIZE; ) { rands[i] = random(); if (rands[i] <= INT_MAX) i++; }/* Send random number array*/ MPI_Send(rands, CHUNKSIZE, MPI_INT, status.MPI_SOURCE, REPLY, world); } } while( request>0 ); } else { /* Begin Worker Block */ request = 1; done = in = out = 0; max = INT_MAX; /* max int, for normalization */ MPI_Send( &request, 1, MPI_INT, server, REQUEST, world ); MPI_Comm_rank( workers, &workerid ); iter = 0; Broadcast Error Bounds: epsilon Create a custom communicator Server process : 1. Receives request to generate a random,2. Computes the random number array, 3. Send array to requestor Worker process : Request the server to generate a random number array

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring Monte Carlo : MPI - Pi (source code) while (!done) { iter++; request = 1; /* Recv. random array from server*/ MPI_Recv( rands, CHUNKSIZE, MPI_INT, server, REPLY, world, &status ); for (i=0; i<CHUNKSIZE-1; ) { x = (((double) rands[i++])/max) * 2 - 1; y = (((double) rands[i++])/max) * 2 - 1; if (x*x + y*y < 1.0) in++; else out++; } MPI_Allreduce(&in, &totalin, 1, MPI_INT, MPI_SUM, workers); MPI_Allreduce(&out, &totalout, 1, MPI_INT, MPI_SUM, workers); Pi = (4.0*totalin)/(totalin + totalout); error = fabs( Pi ); done = (error ); request = (done) ? 0 : 1; if (myid == 0) {/* If “Master” : Print current value of PI */ printf( "\rpi = %23.20f", Pi ); MPI_Send( &request, 1, MPI_INT, server, REQUEST, world ); } else { /* If “Worker” : Request new array if not finished */ if (request) MPI_Send(&request, 1, MPI_INT, server, REQUEST, world); } MPI_Comm_free(&workers); } Worker : Receive random number array from the Server Worker: For each pair of x,y in the random number array, calculate the coordinates Determine if the number is inside or out of the circle Print current value of PI and request for more work Compute the value of pi and Check if error is within threshhold

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring Monte Carlo : MPI - Pi (source code) if (myid == 0) { /* If “Master” : Print Results */ printf( "\npoints: %d\nin: %d, out: %d, to exit\n", totalin+totalout, totalin, totalout ); getchar(); } MPI_Finalize(); } Print the final value of PI

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring Demo : MPI Monte Carlo, Pi > mpirun –np 4 monte 1e-20 pi = points: in: , out: > mpirun –np 4 monte 1e-20 pi = points: in: , out:

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring Topics Introduction Mandelbrot Sets Monte Carlo : PI Calculation Vector Dot-Product Matrix Multiplication

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 Vector Dot Product Multiplication of 2 vectors followed by Summation 43 A[i] X1X1 X2X2 X3X3 X4X4 X5X5 … XnXn B[i] Y1Y1 Y2Y2 Y3Y3 Y4Y4 Y5Y5 … YnYn ∙ = A[i] * B[i] X 1 * Y 1 X 2 * Y 2 X 3 * Y 3 X 4 * Y 4 X 5 * Y 5 … X n * Y n

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring OpenMP Dot Product : using Reduction Initialize variables Initialize OpenMP parallel environment Calculate local computations REDUCTION : ∑ Print value of Dot Product N WorkerThreads Master Thread Workload and schedule is determined by OpenMP during runtime

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 OpenMP Dot Product 45 #include main () { int i, n, chunk; float a[16], b[16], result; n = 16; chunk = 4; result = 0.0; for (i=0; i < n; i++) { a[i] = i * 1.0; b[i] = i * 2.0; } #pragma omp parallel for default(shared) private(i) \ schedule(static,chunk) reduction(+:result) for (i=0; i < n; i++) result = result + (a[i] * b[i]); printf("Final result= %f\n",result); } Reduction example with summation where the result of the reduction operation stores the dotproduct of two vectors ∑a[i]*b[i] Reduction example with summation where the result of the reduction operation stores the dotproduct of two vectors ∑a[i]*b[i] SRC :

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 Demo: Dot Product using Reduction 46 l12]$./reduction a[i] b[i] a[i]*b[i] Final result= l12]$

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring MPI Dot Product Computation Initialize Variables Worker Master Initialize MPI environment Receive Size of vectors Receive local workload for Vector A Receive local workload for Vector B Initialize Variables Initialize MPI Environment Broadcast Size of Vectors Get Vector A & Distribute Partitioned Vector A Get Vector B & Distribute Partitioned Vector B Calculate dot-product for local workloads Print Result REDUCTION ∑ Calculate dot-product for local workloads

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 MPI Dot Product 48 #include #include "mpi.h" #define MAX_LOCAL_ORDER 100 main(int argc, char* argv[]) { float local_x[MAX_LOCAL_ORDER]; float local_y[MAX_LOCAL_ORDER]; int n; int n_bar; /* = n/p */ float dot; int p; int my_rank; void Read_vector(char* prompt, float local_v[], int n_bar, int p, int my_rank); float Parallel_dot(float local_x[], float local_y[], int n_bar); MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &p); MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); if (my_rank == 0) { printf("Enter the order of the vectors\n"); scanf("%d", &n); } MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD); Initialize MPI Environment Broadcast the order of vectors across the workers Parallel Programming with MPI by Peter Pacheco

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 MPI Dot Product 49 n_bar = n/p; Read_vector("the first vector", local_x, n_bar, p, my_rank); Read_vector("the second vector", local_y, n_bar, p, my_rank); dot = Parallel_dot(local_x, local_y, n_bar); if (my_rank == 0) printf("The dot product is %f\n", dot); MPI_Finalize(); } /* main */ void Read_vector( char* prompt /* in */, float local_v[] /* out */, int n_bar /* in */, int p /* in */, int my_rank /* in */) { int i, q; Receive and distribute the two vectors Calculate the parallel dot product for local workloads Master: Print the result of the dot product Parallel Programming with MPI by Peter Pacheco

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 MPI Dot Product 50 float temp[MAX_LOCAL_ORDER]; MPI_Status status; if (my_rank == 0) { printf("Enter %s\n", prompt); for (i = 0; i < n_bar; i++) scanf("%f", &local_v[i]); for (q = 1; q < p; q++) { for (i = 0; i < n_bar; i++) scanf("%f", &temp[i]); MPI_Send(temp, n_bar, MPI_FLOAT, q, 0, MPI_COMM_WORLD); } } else { MPI_Recv(local_v, n_bar, MPI_FLOAT, 0, 0, MPI_COMM_WORLD, &status); } } /* Read_vector */ float Serial_dot( float x[] /* in */, MASTER: Get the input from the User prepare the local workload Get the input from the User load balance in real-time by storing the work chunks in array And sending the array to the worker nodes for processing Get the input from the User load balance in real-time by storing the work chunks in array And sending the array to the worker nodes for processing Worker : Receive the local workload to be processed Serial_dot() : calculates the dot product on local arrays Parallel Programming with MPI by Peter Pacheco

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 MPI Dot Product 51 float y[] /* in */, int n /* in */) { int i; float sum = 0.0; for (i = 0; i < n; i++) sum = sum + x[i]*y[i]; return sum; } /* Serial_dot */ float Parallel_dot( float local_x[] /* in */, float local_y[] /* in */, int n_bar /* in */) { float local_dot; float dot = 0.0; local_dot = Serial_dot(local_x, local_y, n_bar); MPI_Reduce(&local_dot, &dot, 1, MPI_FLOAT, MPI_SUM, 0, MPI_COMM_WORLD); return dot; } /* Parallel_dot */ Serial_dot() : calculates the dot product on local arrays Parallel_dot() : Calls the Serial_dot() to perform the dot product for local workload Calculate the dotproduct and calculate summation using collective MPI_REDUCE calls (SUM) Parallel Programming with MPI by Peter Pacheco

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 Demo: MPI Dot Product 52 l13]$ mpirun …../mpi_dot Enter the order of the vectors 16 Enter the first vector Enter the second vector The dot product is l13]$

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring Topics Introduction Mandelbrot Sets Monte Carlo : PI Calculation Vector Dot-Product Matrix Multiplication

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. Matrix Vector Multiplication

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring Matrix-Vector Multiplication c = A xb

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring Implementing Matrix Multiplication Sequential Code Assume throughout that the matrices are square (n x n matrices). The sequential code to compute A x B could simply be for (i = 0; i < n; i++) for (j = 0; j < n; j++) { c[i][j] = 0; for (k = 0; k < n; k++) c[i][j] = c[i][j] + a[i][k] * b[k][j]; } This algorithm requires n 3 multiplications and n 3 additions, leading to a sequential time complexity of O(n 3 ). Very easy to parallelize. Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved.

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 Implementing Matrix Multiplication With n processors (and n x n matrices), we can obtain: Time complexity of O(n 2 ) with n processors Each instance of inner loop is independent and can be done by a separate processor Time complexity of O(n) with n 2 processors One element of A and B assigned to each processor. Cost optimal since O(n 3 ) = n x O(n 2 ) = n 2 x O(n). Time complexity of O(log n) with n 3 processors By parallelizing the inner loop. Not cost-optimal since O(n 3 ) < n 3 x O(log n). O(log n) lower bound for parallel matrix multiplication. 57

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring Block Matrix Multiplication Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. Partitioning into sub-matricies

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. Matrix Multiplication

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring Performance Improvement Using tree construction n numbers can be added in O(log n) steps (using n 3 processors): Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved.

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring OpenMP: Flowchart for Matrix Multiplication Initialize variables & matrices Initialize OpenMP Environment Compute the Matrix product for the local workload Print Results Compute the Matrix product for the local workload Schedule and workload chunksize are determined based on user preferences during compile/run time Since each thread works on portion of the array and updates different parts of the same array synchronization is not needed

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 OpenMP Matrix Multiplication 62 #include /* Main Program */ main() { int NoofRows_A, NoofCols_A, NoofRows_B, NoofCols_B, i, j, k; NoofRows_A = NoofCols_A = NoofRows_B = NoofCols_B = 4; float Matrix_A[NoofRows_A][NoofCols_A]; float Matrix_B[NoofRows_B][NoofCols_B]; float Result[NoofRows_A][NoofCols_B]; for (i = 0; i < NoofRows_A; i++) { for (j = 0; j < NoofCols_A; j++) Matrix_A[i][j] = i + j; } /* Matrix_B Elements */ for (i = 0; i < NoofRows_B; i++) { for (j = 0; j < NoofCols_B; j++) Matrix_B[i][j] = i + j; } printf("The Matrix_A Is \n"); Initialize the two Matrices A[][] & B[][] with sum of their index values SRC :

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 OpenMP Matrix Multiplication 63 for (i = 0; i < NoofRows_A; i++) { for (j = 0; j < NoofCols_A; j++) printf("%f \t", Matrix_A[i][j]); printf("\n"); } printf("The Matrix_B Is \n"); for (i = 0; i < NoofRows_B; i++) { for (j = 0; j < NoofCols_B; j++) printf("%f \t", Matrix_B[i][j]); printf("\n"); } for (i = 0; i < NoofRows_A; i++) { for (j = 0; j < NoofCols_B; j++) { Result[i][j] = 0.0; } #pragma omp parallel for private(j,k) for (i = 0; i < NoofRows_A; i = i + 1) for (j = 0; j < NoofCols_B; j = j + 1) for (k = 0; k < NoofCols_A; k = k + 1) Result[i][j] = Result[i][j] + Matrix_A[i][k] * Matrix_B[k][j]; printf("\nThe Matrix Computation Result Is \n"); Initialize the results matrix with 0.0 Print the Matrices for debugging purposes Using OpenMP parallel For directive: Calculate the product of the two matrices Loadbalancing is done based on the values of OpenMP environment variables and the number of threads SRC :

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 OpenMP Matrix Multiplicaton 64 for (i = 0; i < NoofRows_A; i = i + 1) { for (j = 0; j < NoofCols_B; j = j + 1) printf("%f ", Result[i][j]); printf("\n"); } SRC :

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 DEMO : OpenMP Matrix Multiplication 65 l13]$./omp_mm The Matrix_A Is The Matrix_B Is The Matrix Computation Result Is l13]$

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring Flowchart for MPI Matrix Multiplication “master”“workers” Initialize MPI Environment … Initialize Array Partition Array into workloads Send Workload to “workers” Recv. work … wait for “workers“ to finish task Calculate matrix product … Send result … Recv. results Print results End

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring Matrix Multiplication (source code) #include "mpi.h" #include #define NRA 4 /* number of rows in matrix A */ #define NCA 4 /* number of columns in matrix A */ #define NCB 4 /* number of columns in matrix B */ #define MASTER 0 /* taskid of first task */ #define FROM_MASTER 1 /* setting a message type */ #define FROM_WORKER 2 /* setting a message type */ int main(argc,argv) int argc; char *argv[]; { intnumtasks, /* number of tasks in partition */ taskid, /* a task identifier */ numworkers, /* number of worker tasks */ source, /* task id of message source */ dest, /* task id of message destination */ mtype, /* message type */ rows, /* rows of matrix A sent to each worker */ averow, extra, offset, /* used to determine rows sent to each worker */ i, j, k, rc; /* misc */ doublea[NRA][NCA], /* matrix A to be multiplied */ b[NCA][NCB], /* matrix B to be multiplied */ c[NRA][NCB]; /* result matrix C */ MPI_Status status; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD,&taskid); MPI_Comm_size(MPI_COMM_WORLD,&numtasks); Source : Initialize the MPI environment Source : tutorials/mpi/samples/C/mpi_mm.c

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring Matrix Multiplication (source code) if (numtasks < 2 ) { printf("Need at least two MPI tasks. Quitting...\n"); MPI_Abort(MPI_COMM_WORLD, rc); exit(1); } numworkers = numtasks-1; if (taskid == MASTER){ for (i=0; i<NRA; i++) for (j=0; j<NCA; j++){ a[i][j]= i+j+1; b[i][j]= i+j+1; } printf("Matrix A :: \n"); for (i=0; i<NRA; i++){ printf("\n"); for (j=0; j<NCB; j++) printf("%6.2f ", a[i][j]); } printf("Matrix B :: \n"); for (i=0; i<NRA; i++) { printf("\n"); for (j=0; j<NCB; j++) printf("%6.2f ", b[i][j]); averow = NRA/numworkers; extra = NRA%numworkers; offset = 0; mtype = FROM_MASTER; Source : tutorials/mpi/samples/C/mpi_mm.c MASTER: Initialize the matrix A & B Print the two matrices for Debugging purposes Calculate the number of rows to be processed by each worker Calculate the number of overflow rows to be processed additionally by each worker

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring Matrix Multiplication (source code) for (dest=1; dest<=numworkers; dest++) {/* To each worker send : Start point, number of rows to process, and sub-arrays to process */ rows = (dest <= extra) ? averow+1 : averow; printf("Sending %d rows to task %d offset=%d\n",rows,dest,offset); MPI_Send(&offset, 1, MPI_INT, dest, mtype, MPI_COMM_WORLD); MPI_Send(&rows, 1, MPI_INT, dest, mtype, MPI_COMM_WORLD); MPI_Send(&a[offset][0], rows*NCA, MPI_DOUBLE, dest, mtype, MPI_COMM_WORLD); MPI_Send(&b, NCA*NCB, MPI_DOUBLE, dest, mtype, MPI_COMM_WORLD); offset = offset + rows; } /* Receive results from worker tasks */ mtype = FROM_WORKER; /* Message tag for messages sent by “workers” */ for (i=1; i<=numworkers; i++) { source = i; /* offset stores the (processing) starting point of work chunk */ MPI_Recv(&offset, 1, MPI_INT, source, mtype, MPI_COMM_WORLD, &status); MPI_Recv(&rows, 1, MPI_INT, source, mtype, MPI_COMM_WORLD, &status); MPI_Recv(&c[offset][0], rows*NCB, MPI_DOUBLE, source, mtype, MPI_COMM_WORLD, &status); printf("Received results from task %d\n",source); } printf("******************************************************\n"); printf("Result Matrix:\n"); for (i=0; i<NRA; i++) { printf("\n"); for (j=0; j<NCB; j++) printf("%6.2f ", c[i][j]); } printf("\n******************************************************\n"); printf ("Done.\n"); } MASTER : Send the workload chunk across to each of the worker MASTER: Receive the workload chunk from the workers c[][] contains the matrix products calculated for each workload chunk by the corresponding worker Source : tutorials/mpi/samples/C/mpi_mm.c

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring Matrix Multiplication (source code) /**************************** worker task ************************************/ if (taskid > MASTER) { mtype = FROM_MASTER; MPI_Recv(&offset, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD, &status); MPI_Recv(&rows, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD, &status); MPI_Recv(&a, rows*NCA, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD, &status); MPI_Recv(&b, NCA*NCB, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD, &status); for (k=0; k<NCB; k++) for (i=0; i<rows; i++) { c[i][k] = 0.0; for (j=0; j<NCA; j++) /* Calculate the product and store result in C */ c[i][k] = c[i][k] + a[i][j] * b[j][k]; } mtype = FROM_WORKER; MPI_Send(&offset, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD); MPI_Send(&rows, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD); /* Worker sends the resultant array to the master */ MPI_Send(&c, rows*NCB, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD); } MPI_Finalize(); } Source : WORKER: Receive the workload to be processed by each worker Calculate the matrix product and store the result in c[][] Send the computed results array to the Master Source : puting/tutorials/mpi/sample s/C/mpi_mm.c

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring Demo : Matrix Multiplication matrix_multiplication]$ mpirun -np 4 -machinefile ~/hosts./mpi_mm mpi_mm has started with 4 tasks. Initializing arrays... Matrix A :: Matrix B :: Sending 2 rows to task 1 offset=0 Sending 1 rows to task 2 offset=2 Sending 1 rows to task 3 offset=3 Received results from task 1 Received results from task 2 Received results from task 3 Result Matrix: matrix_multiplication]$

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring