Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

Slides:



Advertisements
Similar presentations
Introduction to Openmp & openACC
Advertisements

Practical techniques & Examples
PARALLEL PROGRAMMING WITH OPENMP Ing. Andrea Marongiu
Scientific Programming OpenM ulti- P rocessing M essage P assing I nterface.
1 OpenMP—An API for Shared Memory Programming Slides are based on:
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.
1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 21, 2012 Programming with Shared Memory Introduction to OpenMP.
OpenMPI Majdi Baddourah
A Very Short Introduction to OpenMP Basile Schaeli EPFL – I&C – LSP Vincent Keller EPFL – STI – LIN.
Csinparallel.org Patterns and Exemplars: Compelling Strategies for Teaching Parallel and Distributed Computing to CS Undergraduates Libby Shoop Joel Adams.
Programming with Shared Memory Introduction to OpenMP
CS470/570 Lecture 5 Introduction to OpenMP Compute Pi example OpenMP directives and options.
Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing.
CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.
1 Datamation Sort 1 Million Record Sort using OpenMP and MPI Sammie Carter Department of Computer Science N.C. State University November 18, 2004.
Parallel Programming in Java with Shared Memory Directives.
Lecture 5: Shared-memory Computing with Open MP. Shared Memory Computing.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Parallel & Cluster Computing MPI Basics Paul Gray, University of Northern Iowa David Joiner, Shodor Education Foundation Tom Murphy, Contra Costa College.
ECE 1747 Parallel Programming Shared Memory: OpenMP Environment and Synchronization.
1 OpenMP Writing programs that use OpenMP. Using OpenMP to parallelize many serial for loops with only small changes to the source code. Task parallelism.
OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
Message Passing Programming with MPI Introduction to MPI Basic MPI functions Most of the MPI materials are obtained from William Gropp and Rusty Lusk’s.
Hybrid MPI and OpenMP Parallel Programming
Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j
OpenMP fundamentials Nikita Panov
Message Passing Programming Model AMANO, Hideharu Textbook pp. 140-147.
1 " Teaching Parallel Design Patterns to Undergraduates in Computer Science” Panel member SIGCSE The 45 th ACM Technical Symposium on Computer Science.
High-Performance Parallel Scientific Computing 2008 Purdue University OpenMP Tutorial Seung-Jai Min School of Electrical and Computer.
1 Message Passing Models CEG 4131 Computer Architecture III Miodrag Bolic.
Chapter 4 Message-Passing Programming. The Message-Passing Model.
MPI and OpenMP.
Programming distributed memory systems: Message Passing Interface (MPI) Distributed memory systems: multiple processing units working on one task (e.g.
3/12/2013Computer Engg, IIT(BHU)1 MPI-1. MESSAGE PASSING INTERFACE A message passing library specification Extended message-passing model Not a language.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Using Compiler Directives Paraguin Compiler 1 © 2013 B. Wilkinson/Clayton Ferner SIGCSE 2013 Workshop 310 session2a.ppt Modification date: Jan 9, 2013.
Special Topics in Computer Engineering OpenMP* Essentials * Open Multi-Processing.
Message Passing Interface Using resources from
ITCS 4/5145 Parallel Computing, UNC-Charlotte, B
1 ITCS4145 Parallel Programming B. Wilkinson March 23, hybrid-abw.ppt Hybrid Parallel Programming Introduction.
Introduction to OpenMP
Hybrid Parallel Programming with the Paraguin compiler
CS427 Multicore Architecture and Parallel Computing
Computer Engg, IIT(BHU)
MPI Message Passing Interface
Introduction to OpenMP
September 4, 1997 Parallel Processing (CS 667) Lecture 5: Shared Memory Parallel Programming with OpenMP* Jeremy R. Johnson Parallel Processing.
Paraguin Compiler Examples.
Sieve of Eratosthenes.
Using compiler-directed approach to create MPI code automatically
Paraguin Compiler Examples.
Using compiler-directed approach to create MPI code automatically
Hybrid Parallel Programming
Paraguin Compiler Communication.
Paraguin Compiler Version 2.1.
Paraguin Compiler Examples.
Programming with Shared Memory Introduction to OpenMP
Paraguin Compiler Version 2.1.
Hybrid Parallel Programming
Programming with Shared Memory
Using compiler-directed approach to create MPI code automatically
Hybrid Parallel Programming
Introduction to OpenMP
Patterns Paraguin Compiler Version 2.1.
Programming with Shared Memory
Hybrid Parallel Programming
Working in The IITJ HPC System
Programming Parallel Computers
Presentation transcript:

Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner

OpenMP Jointly defined by a group of major computer hardware and software vendors, OpenMP is a portable, scalable model that gives shared- memory parallel programmers a simple and flexible interface for developing parallel applications for platforms ranging from the desktop to the supercomputer 2(c) 2011 Clayton S. Ferner

The Paraguin Compiler The Paraguin Compiler is a compiler written by me (no group, no funding – just me by myself) at UNCW The intent is to create a similar abstraction as OpenMP but for use on a distributed-memory system 3(c) 2011 Clayton S. Ferner

4

OpenMP MPI is a message-passing interface that provides a means to implement parallel algorithms on distributed-memory systems (such as clusters) The OpenMP Application Program Interface (API) supports multi-platform shared-memory parallel programming in C/C++ and Fortran on all architectures, including Unix platforms and Windows NT platforms. * * The OpenMP® API specification for parallel programming ( 5(c) 2011 Clayton S. Ferner

OpenMP (cont) Parallelization is directed by the programmer through the use of pragmas Pragma are used to pass information to the compiler, but are ignored (like comments) if the compiler does not recognize the pragma Pragmas can be inserted for a particular compiler without “breaking” the code for other compilers. 6(c) 2011 Clayton S. Ferner

OpenMP Pragmas #pragma omp parallel structured-block The block will be executed in parallel by all threads 7(c) 2011 Clayton S. Ferner

OpenMP Pragmas #pragma omp for for loop The loop will be executed in parallel by all threads The iterations are divided into “chunks” which the threads execute (although the programmer can control this). There is a barrier at the end of the for loop (i.e. threads will synchronize at the end) 8(c) 2011 Clayton S. Ferner

OpenMP Pragmas #pragma omp parallel for for loop Equivalent to doing #pragma omp parallel #pragma omp for for loop 9(c) 2011 Clayton S. Ferner

OpenMP Pragmas #pragma omp critical structured-block Defines a critical section Only one thread may be executing the block at any given time 10(c) 2011 Clayton S. Ferner

OpenMP Pragmas #pragma omp barrier All threads will wait at the barrier until all other threads have reached the same barrier 11(c) 2011 Clayton S. Ferner

OpenMP Examples void simple(int n, float *a, float *b) { int i; #pragma omp parallel for for (i=0; i<n; i++) /* i is private by default */ b[i] = (a[i] + a[i-1]) / 2.0; } 12(c) 2011 Clayton S. Ferner

OpenMP Examples Thread 0 b[0] = … b[1] = … b[2] = … b[3] = … Thread 1 b[4] = … b[5] = … b[6] = … b[7] = … Thread 2 b[8] = … b[9] = … b[10] = … b[11] = … Thread 3 b[12] = … b[13] = … b[14] = … Assume n=15 and the number of threads is 4 13(c) 2011 Clayton S. Ferner

OpenMP Examples int main(){ int x = 2; #pragma omp parallel num_threads(2) shared(x) { if (omp_get_thread_num() == 0) x = 5; else /* Print 1: the following read of x has a race */ printf("1: Thread# %d: x = %d\n", omp_get_thread_num(),x ); #pragma omp barrier if (omp_get_thread_num() == 0) printf("2: Thread# %d: x = %d\n", omp_get_thread_num(),x ); else printf("3: Thread# %d: x = %d\n", omp_get_thread_num(),x ); } return 0; } 14(c) 2011 Clayton S. Ferner

OpenMP Examples $./test 1: Thread# 3: x = 2 1: Thread# 2: x = 5 1: Thread# 1: x = 5 3: Thread# 2: x = 5 3: Thread# 1: x = 5 2: Thread# 0: x = 5 3: Thread# 3: x = 5 $ 15(c) 2011 Clayton S. Ferner

16(c) 2011 Clayton S. Ferner

Paraguin Compiler The Paraguin Compiler is a parallelizing compiler that produces parallel code using MPI to run on a distributed-memory system (cluster) Based on SUIF Compiler System (suif.stanford.edu) 17(c) 2011 Clayton S. Ferner

Pragma Directives Similar to OpenMP, the compiler is directed through the use of pragma statements The goal is to create a similar abstraction as OpenMP but on a distributed-memory system 18(c) 2011 Clayton S. Ferner

Parallel Region Defining a parallel region – #pragma paraguin begin_parallel – #pragma paraguin end_parallel Statements between the begin and end parallel region are executed by all processors Statements outside the parallel region are executed by the master thread only (pid 0) 19(c) 2011 Clayton S. Ferner

Hello World int __guin_mypid = 0; int main(int argc, char *argv[]) { char hostname[256]; printf("Master process %d starting.\n", __guin_mypid); ; #pragma paraguin begin_parallel gethostname(hostname, 255); printf("Hello world from process %3d on machine %s.\n", __guin_mypid, hostname); ; #pragma paraguin end_parallel printf("Goodbye world from process %d.\n", __guin_mypid); } 20(c) 2011 Clayton S. Ferner

Hello World Results Compiling $ runparaguin hello.c Processing file hello.spd Parallelizing procedure: "main" Running $ mpirun –nolocal -np 8 hello.out Hello world from process 3. Hello world from process 1. Hello world from process 7. Hello world from process 5. Hello world from process 4. Hello world from process 2. Hello world from process 6. Master process 0 starting. Hello world from process 0. Goodbye world from process 0. 21(c) 2011 Clayton S. Ferner

Hello World (cont.) Notice the semi colons in front of the pragma statements. SUIF attaches the pragmas to the most recently seen statement, which may be nested. In order to have them attach to a top level statement, we introduce a blank statement (‘;’) to which the pragma can be attached. 22(c) 2011 Clayton S. Ferner

Paraguin Predefined Variables Notice the declaration and initialization of the variable __guin_mypid. The predefined variables of paraguin may be declared, initialized, and referenced by the user program. They should not be modified beyond initialization. This is useful to allow the same program to be compiled using gcc. 23(c) 2011 Clayton S. Ferner

Paraguin Predefined Variables IdentifierTypeDescription __guin_NPintNumber of Processors __guin_blkszintBlock size (number of partitions per processor) __guin_mypidintCurrent Processor ID __guin_pidrintReceiving threads processor id __guin_pidwintSending threads processor id __guin_bufferchar []Buffer of data to be transmitted __guin_positionintNumber of bytes in the buffer __guin_statusMPI_StatusStatus of the message __guin_pintCurrent Partition Number __guin_printReceiving partition number __guin_pwintSending partition number 24(c) 2011 Clayton S. Ferner

Parallel for #pragma paraguin forall C p i j k \ x0 \ x0 The next for loop nest will be partitioned to run on multiple processors. The data that follows the “forall” is a matrix of inequalities to determine which iterations are mapped to partitions p stands for the partition number C stands for constant (or 1). 0x0 is hex for zero (to prevent SUIF from turning it into a string) 25(c) 2011 Clayton S. Ferner

Parallel for (cont.) // LU Decomposition ; #pragma paraguin forall C p i j k \ x0 \ x0 for (i = 0; i <= N; i++) for (j = i + 1; j <= N; j++) { X[j][i] = X[j][i] / X[i][i]; for (k = i + 1; k <= N; k++) X[j][k]=X[j][k]-X[j][i]*X[i][k]; } 26(c) 2011 Clayton S. Ferner

Parallel for (cont.) C p i j k \ x0 \ x0 This matrix represents the affine expressions of inequalities: p i j k X≤ (c) 2011 Clayton S. Ferner

Parallel for (cont.) p i j k X≤ – p + i – j ≤ p – i + j ≤ 0 p ≥ i - j - 1 p ≤ i - j - 1 p = i - j (c) 2011 Clayton S. Ferner

Parallel for (cont.) p=0 i = 1 j = 0 i = 2 j = 1 i = 3 j = 2. p = i - j - 1 p=1 i = 2 j = 0 i = 3 j = 1 i = 4 j = 2. p=2 i = 3 j = 0 i = 4 j = 1 i = 5 j = 2. p=3 i = 4 j = 0 i = 5 j = 1 i = 6 j = (c) 2011 Clayton S. Ferner

Parallel for (cont.) p=-5p=-4p=-3p=-2p=-1 p=0 0,41,42,43,44,4p=1 0,31,32,33,34,3p=2 0,21,22,23,24,2p=3 0,11,1 3,14,1 j0,01,00,23,04,0 i p = i - j (c) 2011 Clayton S. Ferner

Matrix Multiplication Example ; #pragma paraguin forall C p i j k \ 0x x0 0x0 \ 0x x0 0x0 for (i = 0; i < N; i++) { for (j = 0; j < N; j++) { c[i][j] = 0.0; for (k = 0; k < N; k++) { c[i][j] = c[i][j] + a[i][k] * b[k][j]; } 31(c) 2011 Clayton S. Ferner

Matrix Multiplication Example (cont.) p=0p=1p=2p=3p=4 0,41,42,43,44,4 0,31,32,33,34,3 0,21,22,23,24,2 0,11,1 3,14,1 j0,01,00,23,04,0 i p = i #pragma paraguin forall C p i j k \ 0x x0 0x0 \ 0x x0 0x0 32(c) 2011 Clayton S. Ferner

Mapping Partitions to Physical Processors MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &__guin_NP); MPI_Comm_rank(MPI_COMM_WORLD, &__guin_mypid); __guin_blksz = ceil((ub p - lb p + 1) / __guin_NP); if (0 <= __guin_mypid & __guin_mypid <= __guin_NP - 1) { for (__guin_p = __guin_blksz * __guin_mypid; __guin_p <= min(N, __guin_blksz * (1 + __guin_mypid) -1); __guin_p++)... Where lb p <= p <= ub p 33(c) 2011 Clayton S. Ferner

Mapping Partitions to Physical Processors __guin_pid = 0__guin_pid = 1…__guin_pid = NP-1 p = __guin_blksz * __guin_mypid + 0 … p = __guin_blksz * __guin_mypid + 1 … p = __guin_blksz * __guin_mypid + 2 …… ………N p = __guin_blksz * (__guin_mypid + 1) - 1 … This is a block assignment of partitions to processors (as opposed to cyclic assignment). 34(c) 2011 Clayton S. Ferner

Mapping Partitions to Physical Processors __guin_pid = 0__guin_pid = 1…__guin_pid = NP-1 p = blksz * 0 + 0p = blksz * 1 + 0…p = blksz * (NP – 1) + 0 p = blksz * 0 + 1p = blksz * 1 + 1…p = blksz * (NP – 1) + 1 p = blksz * 0 + 2p = blksz * 1 + 2…… ………N p = blksz * (0 + 1) – 1p = blksz * (1 + 1) - 1… The values of __guin_pid have been substituted in for __guin_pid. __guin_blksz as been replaced with blksz. 35(c) 2011 Clayton S. Ferner

Broadcasting Data Scatter is not implemented in Paraguin One has to use broadcast to get the input to other processors This uses the broadcast operation of MPI which is O(log 2 (NP)) not O(N). #pragma paraguin bcast X MPI_Bcast(X,..., MPI_COMM_WORLD); 36(c) 2011 Clayton S. Ferner

Loop Carried Dependencies Consider the code for the elimination step of Gaussian Elimination: for (i = 1; i <= N; i++) for (j = i+1; j <= N; j++) for (k = N+1; k >= i; k--) a[j][k] = a[j][k] - a[i][k] * a[j][i] / a[i][i]; There is a data dependence between the lhs of the assignment and the a[i][k] reference on the rhs such that iteration i w, j w, k w writes a value to a[j w ][k w ] that is used in iteration i r = i w + 1, j r = i w, k r = k w. The is also a data dependence between the lhs and a[i][i] on the rhs, but we will only consider one dependence here. 37(c) 2011 Clayton S. Ferner

… i=0 j=1 k=3 : a[1][3] = … - a[0][3] * … i=0 j=1 k=2 : a[1][2] = … - a[0][2] * … i=0 j=1 k=1 : a[1][1] = … - a[0][1] * … i=0 j=1 k=0 : a[1][0] = … - a[0][0] * … … i=1 j=2 k=3 : a[2][3] = … - a[1][3] * … i=1 j=2 k=2 : a[2][2] = … - a[1][2] * … i=1 j=2 k=1 : a[2][1] = … - a[1][1] * … i=1 j=2 k=0 : a[2][0] = … - a[1][0] * … Loop Carried Dependencies (cont) 38(c) 2011 Clayton S. Ferner

Loop Carried Dependencies Below is the pragma to specify the data dependence #pragma paraguin dep 0x0 2 C iw jw kw ir jr kr \ 0x0 0x0 1 0x0 -1 0x0 0x0 \ 0x0 0x0 -1 0x0 1 0x0 0x0 \ 0x0 0x0 0x0 1 0x0 0x0 -1 \ 0x0 0x0 0x0 -1 0x0 0x0 1 \ x0 0x0 1 0x0 0x0 \ 1 1 0x0 0x0 -1 0x0 0x0 Paraguin will insert the code for the processor writing the data to pack it up and send it to the processor that needs It also insert the code for the processor that needs that data to receive the message and unpack the data. 39(c) 2011 Clayton S. Ferner

Gather Gathering is getting the partial results back from the various processors to the master process. #pragma paraguin gather 40(c) 2011 Clayton S. Ferner

Gather ; #pragma paraguin gather 3 C i j k \ x0 \ x0 for (i = 0; i <= N; i++) { for (j = i + 1; j <= N; j++) { X[j][i] = X[j][i] / X[i][i]; for (k = i + 1; k <= N; k++) X[j][k] = X[j][k] - X[j][i] * X[i][k]; } Example: LU Decomposition 41(c) 2011 Clayton S. Ferner

Gather #pragma paraguin gather 3 C i j k \ x0 \ x0 3 indicates the 4th (starting at 0) array reference: X[j][k] The system of inequalities indicate which values of the loop variables produce the final values of that array: j=i+1 for all k. 42(c) 2011 Clayton S. Ferner

Gather for (__guin_p = 1 + __guin_blksz * __guin_mypid; __guin_p <= __suif_min(N, __guin_blksz + __guin_blksz * __guin_mypid); __guin_p++){ i = __guin_p - 1; j = i + 1; for (k = 1 * __guin_p; k <= 100; k++) MPI_Pack(&X[j][k],..., MPI_COMM_WORLD); } MPI_Send(__guin_buffer,... 0,..., MPI_COMM_WORLD); 43(c) 2011 Clayton S. Ferner

Some Results 44(c) 2011 Clayton S. Ferner

Gaussian Elimination #pragma paraguin forall C p i j k \ 0x0 -1 0x0 1 0x0 \ 0x0 1 0x0 -1 0x0 #pragma paraguin dep 0x0 2 C iw jw kw ir jr kr \ 0x0 0x0 1 0x0 -1 0x0 0x0 \ 0x0 0x0 -1 0x0 1 0x0 0x0 \ 0x0 0x0 0x0 1 0x0 0x0 -1 \ 0x0 0x0 0x0 -1 0x0 0x0 1 \ x0 0x0 1 0x0 0x0 \ 1 1 0x0 0x0 -1 0x0 0x0 #pragma paraguin dep 0x0 4 C iw jw kw ir jr kr \ 0x0 0x0 -1 0x0 1 0x0 0x0 \ 0x0 0x0 1 0x0 -1 0x0 0x0 \ 0x0 0x0 0x x0 0x0 \ 0x 0x0 0x x0 0x0 \ x0 0x0 1 0x0 0x0 \ 1 1 0x0 0x0 -1 0x0 45(c) 2011 Clayton S. Ferner

Gaussian Elimination (cont.) #pragma paraguin gather 0x0 C i j k \ x0 \ x0 #pragma paraguin gather 0x0 C i j k \ 0x0 -1 0x0 1 \ 0x0 1 0x0 -1 for (i = 1; i <= N; i++) for (j = i+1; j <= N; j++) for (k = N+1; k >= i; k--) a[j][k] = a[j][k] - a[i][k] * a[j][i] / a[i][i]; 46(c) 2011 Clayton S. Ferner

Gaussian Elimination (cont.) 47(c) 2011 Clayton S. Ferner

LU Decomposition #pragma paraguin forall C p i j k \ x0 \ x0 #pragma paraguin dep 3 2 C iw jw kw ir jr \ 1 1 0x0 0x0 -1 0x0 \ x0 0x0 1 0x0 \ 0x0 0x0 1 0x0 -1 0x0 \ 0x0 0x0 -1 0x0 1 0x0 \ 0x0 0x0 0x x0 \ 0x0 0x0 0x x0 #pragma paraguin dep 3 6 C iw jw kw ir jr kr \ x0 0x0 1 0x0 0x0 \ 1 1 0x0 0x0 -1 0x0 0x0 \ 0x0 0x0 1 0x0 -1 0x0 0x0 \ 0x0 0x0 -1 0x0 1 0x0 0x0 \ 0x0 0x0 0x0 1 0x0 0x0 -1 \ 0x0 0x0 0x0 -1 0x0 0x0 1 48(c) 2011 Clayton S. Ferner

LU Decomposition (cont.) #pragma paraguin gather 0 C i j \ 0x0 0x0 0x0 #pragma paraguin gather 3 C i j k \ x0 \ x0 for (i = 0; i <= N; i++) for (j = i + 1; j <= N; j++) { X[j][i] = X[j][i] / X[i][i]; for (k = i + 1; k <= N; k++) X[j][k]=X[j][k]-X[j][i]*X[i][k]; } 49(c) 2011 Clayton S. Ferner

LU Decomposition (cont) 50(c) 2011 Clayton S. Ferner

Redundant Data in Messages We discovered that the messaged sent between processors for the Gaussian Elimination contained redundant data Jerry Martin (MS student 2010) studied detecting and reducing this redundant data 51(c) 2011 Clayton S. Ferner

Redundant Data in Messages (cont)... : pack a[2][2] - Value: : pack a[2][3] - Value: : pack a[2][4] - Value: : pack a[2][5] - Value: : pack a[2][6] - Value: : pack a[2][2] - Value: : pack a[2][3] - Value: : pack a[2][4] - Value: : pack a[2][5] - Value: : pack a[2][6] - Value: : send to 52(c) 2011 Clayton S. Ferner

Suppressing Redundant Data With redundant data in messages Without redundant data in messages 53(c) 2011 Clayton S. Ferner

Suppressing Redundant Data (cont) With redundant data in messages Without redundant data in messages 54(c) 2011 Clayton S. Ferner

Communication Pattern of Gaussian Elimination p0p1p2p3p4p5p6p7 p0p1p2p3p4p5p6p7 p0p1p2p3p4p5p6p7 p0p1p2p3p4p5p6p7 … 55(c) 2011 Clayton S. Ferner

Loop Carried Dependencies and Distributed-Memory Clusters Notice that both Gaussian Elimination and LU Decomposition do not do better that sequential execution regardless of the number of processors In fact, the performance gets worse as the number of processors increases The issue is that we can’t expect to obtain speedup on a distributed-memory cluster when we have communication between processors. The communication is just too slow 56(c) 2011 Clayton S. Ferner

Communication Pattern that works on a distributed-memory system p0 p1p2p3p4p5p6p7 p0 Pattern: Beyond scattering the input and gathering the results, processors work independently. 57(c) 2011 Clayton S. Ferner

Matrix Multiplication ; #pragma paraguin begin_parallel #pragma paraguin forall C p i j k \ 0x x0 0x0 \ 0x x0 0x0 #pragma paraguin bcast a b #pragma paraguin gather 1 C i j k \ 0x0 0x0 0x0 1 \ 0x0 0x0 0x0 -1 // We need to gather all c[i][j]. However, array reference // one is inside the k loop. If we put in an empty gather // then we'll have N copies of each c[i][j] send to the // master. To send just one, then we use k = 0. for (i = 0; i < N; i++) { for (j = 0; j < N; j++) { c[i][j] = 0.0; for (k = 0; k < N; k++) { c[i][j] = c[i][j] + a[i][k] * b[k][j]; } ; #pragma paraguin end_parallel 58(c) 2011 Clayton S. Ferner

Matrix Multiplication ; #pragma paraguin begin_parallel #pragma paraguin forall C p i j k \ 0x x0 0x0 \ 0x x0 0x0 #pragma paraguin bcast a b #pragma paraguin gather 1 C i j k \ 0x0 0x0 0x0 1 \ 0x0 0x0 0x (c) 2011 Clayton S. Ferner

Matrix Multiplication // We need to gather all c[i][j]. However, array reference // one is inside the k loop. If we put in an empty gather // then we'll have N copies of each c[i][j] send to the // master. To send just one, then we use k = 0. for (i = 0; i < N; i++) { for (j = 0; j < N; j++) { c[i][j] = 0.0; for (k = 0; k < N; k++) { c[i][j] = c[i][j] + a[i][k] * b[k][j]; } ; #pragma paraguin end_parallel 60(c) 2011 Clayton S. Ferner

Matrix Multiplication 61(c) 2011 Clayton S. Ferner

Traveling Salesman Problem (TSP) Traveling Salesman Problem is to find the Hamiltonian cycle of a set of cities that minimizes the distance traveled. Doing a brute force search of the solution space requires us to consider all permutations of N cities. This is be N! permutations We can fix the first and last city to be city 0 since that will remove cyclic variations of the same solution E.g. 0->1->2->3->4->0 is the same as 0->4->3->2->1->0 62(c) 2011 Clayton S. Ferner

Traveling Salesman Problem (TSP) N = n*n - 3*n + 2; // (n-1)(n-2) ; #pragma paraguin bcast n #pragma paraguin bcast N #pragma paraguin bcast D #pragma paraguin forall C N pid p \ 0x0 0x \ 0x0 0x for (p = 0; p < N; p++) { perm[1] = p / (n-2) + 1; perm[2] = p % (n-2) + 1; if (perm[2] >= perm[1]) perm[2]++; initialize(perm, n, 3); do { dist = computeDist(D, n, perm); if (minDist dist) { // … Details omitted. // Record the minumum distance and // permutation } } while (increment(perm,n)); } Creating permutations does not lend itself to easy parallelization We can make a loop that iterates (n-1)(n-2) times a base the first two cites on the loop variable City 0 is fixed City 1 = p / (n – 2) + 1 City 2 = p % (n – 2) (c) 2011 Clayton S. Ferner

Traveling Salesman Problem (TSP) N = n*n - 3*n + 2; // (n-1)(n-2) ; #pragma paraguin bcast n #pragma paraguin bcast N #pragma paraguin bcast D #pragma paraguin forall C N pid p \ 0x0 0x \ 0x0 0x for (p = 0; p < N; p++) { perm[1] = p / (n-2) + 1; perm[2] = p % (n-2) + 1; … 64(c) 2011 Clayton S. Ferner

TSP 65(c) 2011 Clayton S. Ferner

Hybrid Hybrid Parallel programs are ones that make use of distributed-memory systems of clusters as well as the multiple cores within each computer (node) of the cluster. We can use MPI to schedule processes to run on multiple nodes and then use OpenMP to schedule threads one the cores within a node. The threads of separate cores use a shared- memory model whereas between nodes, MPI uses a distributed-memory model. 66(c) 2011 Clayton S. Ferner

Doing Hybrid in Paraguin The Paraguin compiler is a source-to-source compiler. It creates C code with MPI calls from C code. This new code is compiled using the mpicc script, which uses gcc. gcc also has openMP support. The Paraguin compiler will simply pass through pragmas that it does not recognize creating a hybrid program. 67(c) 2011 Clayton S. Ferner

Matrix Multiplication (Hybrid) … #pragma paraguin forall C p i j k \ 0x x0 0x0 \ 0x x0 0x0 … #pragma omp parallel for private(i,j,k) schedule(static) num_threads(NUM_THREADS) for (i = 0; i < N; i++) { for (j = 0; j < N; j++) { c[i][j] = 0.0; for (k = 0; k < N; k++) c[i][j] = c[i][j] + a[i][k] * b[k][j]; } … 68(c) 2011 Clayton S. Ferner

Matrix Multiplication (Hybrid) 69(c) 2011 Clayton S. Ferner

Questions? 70(c) 2011 Clayton S. Ferner