Using compiler-directed approach to create MPI code automatically

Slides:

Advertisements

Similar presentations

Introduction to C Programming

Advertisements

Practical techniques & Examples

1 Introduction to Collective Operations in MPI l Collective operations are called by all processes in a communicator. MPI_BCAST distributes data from one.

Lecture 20 Arrays and Strings

What is a pointer? First of all, it is a variable, just like other variables you studied So it has type, storage etc. Difference: it can only store the.

MPI Collective Communications

Partitioning and Divide-and-Conquer Strategies ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, Jan 23, 2013.

CSE 1301 Lecture 6B More Repetition Figures from Lewis, “C# Software Solutions”, Addison Wesley Briana B. Morrison.

Topic 9 – Introduction To Arrays. CISC105 – Topic 9 Introduction to Data Structures Thus far, we have seen “simple” data types. These refers to a single.

Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 21, 2012 Programming with Shared Memory Introduction to OpenMP.

Arrays Data Structures - structured data are data organized to show the relationship among the individual elements. It usually requires a collecting mechanism.

Chapter 6Java: an Introduction to Computer Science & Programming - Walter Savitch 1 l Array Basics l Arrays in Classes and Methods l Programming with Arrays.

Programming with Shared Memory Introduction to OpenMP

1 OpenMP Writing programs that use OpenMP. Using OpenMP to parallelize many serial for loops with only small changes to the source code. Task parallelism.

OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)

Programming With C.

1 " Teaching Parallel Design Patterns to Undergraduates in Computer Science” Panel member SIGCSE The 45 th ACM Technical Symposium on Computer Science.

Threaded Programming Lecture 4: Work sharing directives.

C++ / G4MICE Course Session 2 Basic C++ types. Control and Looping Functions in C Function/method signatures and scope.

Fortran: Control Structures Session Three ICoCSIS.

Using Compiler Directives Paraguin Compiler 1 © 2013 B. Wilkinson/Clayton Ferner SIGCSE 2013 Workshop 310 session2a.ppt Modification date: Jan 9, 2013.

Lecture 3: More Java Basics Michael Hsu CSULA. Recall From Lecture Two  Write a basic program in Java  The process of writing, compiling, and running.

1 ITCS4145 Parallel Programming B. Wilkinson March 23, hybrid-abw.ppt Hybrid Parallel Programming Introduction.

User-Written Functions

The Machine Model Memory

Introduction to OpenMP

(Numerical Arrays of Multiple Dimensions)

A bit of C programming Lecture 3 Uli Raich.

Hybrid Parallel Programming with the Paraguin compiler

C Programming Tutorial – Part I

Functions and Structured Programming

Quiz 11/15/16 – C functions, arrays and strings

MPI Message Passing Interface

Introduction to OpenMP

Paraguin Compiler Examples.

Algorithm Analysis CSE 2011 Winter September 2018.

Sieve of Eratosthenes.

Lecture 07 More Repetition Richard Gesick.

Programmazione I a.a. 2017/2018.

Parallel Sorting Algorithms

Parallel Graph Algorithms

IPC144 Introduction to Programming Using C Week 1 – Lesson 2

Arrays, For loop While loop Do while loop

Using compiler-directed approach to create MPI code automatically

Paraguin Compiler Examples.

CS 240 – Lecture 18 Command-line Arguments, Typedef, Union, Bit Fields, Pointers to Functions.

Chapter 7 Additional Control Structures

Week 9 – Lesson 1 Arrays – Character Strings

EKT150 : Computer Programming

Hybrid Parallel Programming

Introduction to High Performance Computing Lecture 20

Paraguin Compiler Communication.

Paraguin Compiler Version 2.1.

Paraguin Compiler Examples.

Programming with Shared Memory Introduction to OpenMP

Paraguin Compiler Version 2.1.

Introduction to C Topics Compilation Using the gcc Compiler

Pattern Programming Tools

Parallel Sorting Algorithms

ITCS 4/5145 Parallel Computing, UNC-Charlotte, B

Programming with Shared Memory

Using compiler-directed approach to create MPI code automatically

Hybrid Parallel Programming

Introduction to OpenMP

Introduction to C Topics Compilation Using the gcc Compiler

Patterns Paraguin Compiler Version 2.1.

Programming with Shared Memory

Functions continued.

Quiz Questions How does one execute code in parallel in Paraguin?

Presentation transcript:

Using compiler-directed approach to create MPI code automatically Paraguin Compiler Continued ITCS4145/5145, Parallel Programming Clayton Ferner/B. Wilkinson March 11, 2014. ParagionSlides2.ppt

The Paraguin compiler is being developed by Dr The Paraguin compiler is being developed by Dr. C Ferner, UNC-Wilmington Following based upon his slides (Assumes already seen OpenMP)

Running a Parallel Program When your parallel program is run, you specify how many processors you want on the command line (or in a job submission file) Processes (1 per processor) will be given a rank, which is unique, in the range [0 .. NP-1], where NP is the number of processors. Process 0 is considered to be the master.

Parallel Region // Sequential Code … #pragma paraguin begin_parallel // Code to be executed by all processors #pragma paraguin end_parallel Code outside parallel region executed by master process (with rank = 0) only. All other processors do not execute this code. Code inside parallel region is executed by all processors

Hello World To deal with an incompatibility issue between SUIF compiler and gcc. Don’t worry about it, but just put it into your program. #ifdef PARAGUIN typedef void* __builtin_va_list; #endif #include <stdio.h> int __guin_rank = 0; int main(int argc, char *argv[]) { char hostname[256]; printf("Master process %d starting.\n", __guin_rank); #pragma paraguin begin_parallel gethostname(hostname, 255); printf("Hello world from process %3d on machine %s.\n", __guin_rank, hostname); #pragma paraguin end_parallel printf("Goodbye world from process %d.\n", _guin_rank); } A predefined Paraguin identifier that represents the ID of each MPI process. We are allowed to declare it and even initialize it, but it should not be modified.

Explanation of Hello World PE 0 PE 1 PE 2 PE 3 PE 4 PE 5 printf("Master process %d starting.\n", __guin_rank); #pragma paraguin begin_parallel gethostname(hostname, 255); printf("Hello world from process %3d on machine %s.\n", __guin_rank, hostname); #pragma paraguin end_parallel This defines a region to be executed by all processors. Outside of this region, only the master process (with rank = 0) executes the statements. The other processors skip it. THERE IS NO IMPLIED BARRIER AT THE BEGINNING OR END OF A PARALLEL REGION.

Result of Hello World Compiling Running All on one line $ scc -DPARAGUIN -D__x86_64__ -I/opt/openmpi/include/ -cc mpicc helloWorld.c -o helloWorld $ mpiexec –n 8 hello.out Master process 0 starting. Hello world from process 0 on machine compute-1-5.local. Goodbye world from process 0. Hello world from process 1 on machine compute-1-5.local. Hello world from process 2 on machine compute-1-5.local. Hello world from process 3 on machine compute-1-5.local. Hello world from process 4 on machine compute-1-1.local. Hello world from process 5 on machine compute-1-1.local. Hello world from process 6 on machine compute-1-1.local. Hello world from process 7 on machine compute-1-1.local. Running

Notes on location of pragmas SUIF attaches the pragmas to the last instruction, which may be deeply nested. This makes it difficult for Paraguin to find the pragmas

Incorrect Location of Pragma This code for (i = 0; i < n; i++) for (j = 0; j < n; j++) a[i][j] = 0; #pragma paraguin begin_parallel Actually appears like this: a[i][j] = 0; #pragma paraguin begin_parallel }

Solution for (j = 0; j < n; j++) a[i][j] = 0; ; Solution: Insert a semicolon on a line by itself before a block of pragma statements This will insert a NOOP instruction into the code to which the pragmas are attached for (i = 0; i < n; i++) for (j = 0; j < n; j++) a[i][j] = 0; ; #pragma paraguin begin_parallel Usually, needed after a nesting (e.g. for loop nest, while loop nest, etc.)

More on Parallel Regions The parallel region pragmas must be at the topmost nesting within a function. int f () { #pragma paraguin begin_parallel … #pragma paraguin end_parallel }

More on Parallel Regions Al processes must reach end_parallel int g() { #pragma paraguin begin_parallel ... if (a < b) #pragma paraguin end_parallel This is an error

Parallel regions and sequential regions are implemented in Paraguin by using a simple if statement to check whether the rank of the process is zero. The user may create a sequential region within a parallel region simply by surrounding that code with an if statement such as: int __guin_rank = 0; ... #pragma paraguin begin_parallel if (__guin_rank == 0) { // Sequential region within a parallel region } #pragma paraguin end_parallel

Parallel Regions Related to Functions If a function is to be executed totally in parallel, it does not need its own parallel region: int f () { … } int main() #pragma paraguin begin_parallel f(); #pragma paraguin end_parallel This has been relaxed from what was said in the user manual. This one will execute in parallel This one will execute sequentially If a function has one or Paraguin directives, then the calling routine also needs a parallel region.

Initializations Initializations of variables are executable statements (as opposed to the declaration) Therefore, they need to be within a parallel region for all processes: int f () { int a = 23, b; #pragma paraguin begin_parallel b = 46; … #pragma paraguin end_parallel } a will be initialized on the master only because it is outside a parallel region b will be initialized on all processors

Parallel Constructs #pragma paraguin barrier #pragma paraguin forall #pragma paraguin bcast #pragma paraguin scatter #pragma paraguin gather #pragma paraguin reduce All of these must be within a parallel region (some would deadlock if not). Similar placement to worksharing constructs in parallel regions in OpenMP

Barrier A barrier is a point at which all processors stop until they all arrive at the same point, after which they may proceed. Perform a barrier on MPI_COMM_WORLD, all processes … #pragma paraguin barrier PE 0 PE 1 PE 2 PE 3 PE 4 PE 5

Parallel For (or forall) To execute a for loop in parallel: #pragma paraguin forall [chunksize] Each process will execute a different partition of the iterations (call the iteration space) Partitions will be no larger than chunksize number of iterations Default chunksize where n is the number of iterations and P is the number of processes

Parallel For (or forall) For example consider: #pragma paraguin forall for (i = 0; i < n; i++) { <body> Suppose n = 13. The iteration space is i=0 i=5 i=10 i=1 i=6 i=11 i=2 i=7 i=12 i=3 i=8 i=4 i=9

Parallel For (or forall) Also suppose we have 4 processes. Default chunksize is The iteration space will be executed by the 4 processes as: P0 P1 P2 P3 i=0 i=4 i=8 i=12 i=1 i=5 i=9 - i=2 i=6 i=10 i=3 i=7 i=11

Parallel For (other notes) Note the for loop that is executed as a forall must be a simple for loop: The increment must be positive 1 (and the upper bound must be greater than the lower bound) The loop termination must use either < or <= A nested for loop can be a forall: for (i = 0; i < n; i++) { #pragma paraguin forall for (j = 0; j < n; j++) { However, foralls cannot be nested

How to transform for loops to simple for loops Count down loop: for (i = n-1; i >=0; i--) { … Nested loops for (i = 0; i < n; i++) { for (j = 0; j < n; j++) { #pragma paraguin forall for (tmp = 0; tmp < n; tmp++) { i = n – tmp – 1; … for (tmp = 0; tmp < n*n; tmp++) { i = tmp / n; j = tmp % n;

Parallel For (other notes) If the user provides a chunksize, then each process cycles through chunksize iterations in a cyclic fashion Specifying a chunksize of 1 is cyclic scheduling (better load balancing) P0 P1 P2 P3 i=0 i=1 i=2 i=3 i=4 i=5 i=6 i=7 i=8 i=9 i=10 i=11 i=12 -

#pragma paraguin bcast <list of variables> Broadcast Broadcasting data sends the same data to all processes from the master: #pragma paraguin bcast <list of variables> <list of variables> is a white space separated list. Variables may be arrays or scalars - byte, char, unsigned char, short, int, long, float, double, and long double. There will be a separate broadcast performed for each variable in the list. If a variable is a pointer type, then only one element, to which the pointer points, of the base type will be broadcast. If the pointer is used to point to an array or a portion of an array, the user must provide the number of elements to send, see next.

Broadcast example int a, b[N][M], n; char *s = “hello world”; n = strlen(s) + 1; #pragma paraguin begin_parallel #pragma paraguin bcast a b n s( n ) … #pragma paraguin end_parallel Variable a is a scalar and b is an array, but the correct number of bytes are broadcast N*M*sizeof(int) bytes are broadcast for variable b. Variable s is a string or a pointer. There is no way to know how big the data actually is Pointers require a size (such as s( n )) If size is not given then only one character will be broadcast the size is not given then only one character will be broadcast

Broadcast char *s = “hello world”; n = strlen(s) + 1; int a, b[N][M], n; char *s = “hello world”; n = strlen(s) + 1; #pragma paraguin begin_parallel #pragma paraguin bcast a b n s( n ) Notice that s and n are initialize on the master only 1 is added to strlen(s) to include null character Variable n must be broadcast BEFORE variable s Put spaces between parentheses and size (e.g. ( n ))

#pragma paraguin scatter <list of variables> Divides and scatters data that resides on the master among the other processors #pragma paraguin scatter <list of variables> Here chucksize = 2

Scatter { int B[N]; … // Initialize B somehow void f(int *A, int n) { int B[N]; … // Initialize B somehow #pragma paraguin begin_parallel #pragma paraguin scatter A( n ) B ... #pragma paraguin end_parallel Same thing applies for pointers with scatter as with broadcast. The size must be given. Only arrays should be scattered (it makes no sense to scatter a scalar).

Scatter The default chunksize is where N is the number of rows and P is the number of processes Notice that the rows are scattered, not columns User defined chunksize is not yet implemented

Gather Gather works just like Scatter except that the data moves in the opposite direction #pragma paraguin gather <list of variables>

Gather Gather is the collection of partial results back to the master The default chunksize is where N is the number of rows and P is the number of processes User defined chunksize is not yet implemented

Reduction A reduction is when a binary commutative operator is applied to a collection of values producing a single value: #pragma paraguin reduce <op> <source> <result> where <op> is the operator <source> is the variable with the data to be reduced <result> is the variable that will hold the answer

Reduction For example, applying summation to the following values: Produces the single value of 549 MPI does not specify how reduction should be implemented; however, … 83 40 23 85 90 2 74 68 51 33

Reduction A reduction could be implemented fairly efficiently on multiple processor using a tree In which case the time complexity is O(log(P)) with P processes

Reduction Available operators that can be used in a reduction (MPI): Description max Maximum lor Logical or min Minimum bor Bitwise or sum Summation lxor Logical exclusive or prod Product bxor Bitwise exclusive or land Logical and maxloc Maximum and location Band Bitwise and minloc Minimum and location

Reduction ... #pragma paraguin begin_parallel double c, result_c; ... #pragma paraguin begin_parallel // Each processor assigns some value to the variable c #pragma paraguin reduce sum c result_c // The variable result_c on the master now holds the result // of summing the values of the variable c on all the processors

Reducing an Array When a reduction is applied to an array, the corresponding values in the same relative position in the array are reduced across processors double c[N], result_c[N]; ... #pragma paraguin begin_parallel // Each process assigns N values to array c #pragma paraguin reduce sum c result_c

More detailed information on Paraguin at Questions? More detailed information on Paraguin at http://people.uncw.edu/cferner/Paraguin/userman.pdf

Next Topic Higher level Patterns in Paraguin: Scatter/Gather template Stencil