Paraguin Compiler Version 2.1.

Slides:

Advertisements

Similar presentations

Introduction to C Programming

Advertisements

1 Introduction to Collective Operations in MPI l Collective operations are called by all processes in a communicator. MPI_BCAST distributes data from one.

MPI Collective Communications

Partitioning and Divide-and-Conquer Strategies ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, Jan 23, 2013.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 21, 2012 Programming with Shared Memory Introduction to OpenMP.

By: Mr. Baha Hanene Chapter 3. Learning Outcomes We will cover the learning outcome 02 in this chapter i.e. Use basic data-types and input / output in.

Programming with Shared Memory Introduction to OpenMP

Lecture 5: Shared-memory Computing with Open MP. Shared Memory Computing.

1 OpenMP Writing programs that use OpenMP. Using OpenMP to parallelize many serial for loops with only small changes to the source code. Task parallelism.

Programming With C.

MPI Introduction to MPI Commands. Basics – Send and Receive MPI is a message passing environment. The processors’ method of sharing information is NOT.

1 " Teaching Parallel Design Patterns to Undergraduates in Computer Science” Panel member SIGCSE The 45 th ACM Technical Symposium on Computer Science.

Threaded Programming Lecture 4: Work sharing directives.

3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,

Fortran: Control Structures Session Three ICoCSIS.

Using Compiler Directives Paraguin Compiler 1 © 2013 B. Wilkinson/Clayton Ferner SIGCSE 2013 Workshop 310 session2a.ppt Modification date: Jan 9, 2013.

Message Passing Interface Using resources from

1 ITCS4145 Parallel Programming B. Wilkinson March 23, hybrid-abw.ppt Hybrid Parallel Programming Introduction.

Chapter VII: Arrays.

Basic concepts of C++ Presented by Prof. Satyajit De

User-Written Functions

The Machine Model Memory

Introduction to OpenMP

A bit of C programming Lecture 3 Uli Raich.

Lecture 5: Shared-memory Computing with Open MP

Hybrid Parallel Programming with the Paraguin compiler

CS427 Multicore Architecture and Parallel Computing

Functions and Structured Programming

Open[M]ulti[P]rocessing

Computer Engg, IIT(BHU)

MPI Message Passing Interface

Introduction to OpenMP

September 4, 1997 Parallel Processing (CS 667) Lecture 5: Shared Memory Parallel Programming with OpenMP* Jeremy R. Johnson Parallel Processing.

Paraguin Compiler Examples.

Algorithm Analysis CSE 2011 Winter September 2018.

Course Description Algorithms are: Recipes for solving problems.

Sieve of Eratosthenes.

Programmazione I a.a. 2017/2018.

Parallel Sorting Algorithms

Parallel Graph Algorithms

IPC144 Introduction to Programming Using C Week 1 – Lesson 2

Arrays, For loop While loop Do while loop

Using compiler-directed approach to create MPI code automatically

Paraguin Compiler Examples.

Introduction to C++ Programming

Using compiler-directed approach to create MPI code automatically

Hybrid Parallel Programming

Paraguin Compiler Communication.

Paraguin Compiler Examples.

Programming with Shared Memory Introduction to OpenMP

Paraguin Compiler Version 2.1.

Introduction to C Topics Compilation Using the gcc Compiler

Pattern Programming Tools

Parallel Sorting Algorithms

Hybrid Parallel Programming

Programming with Shared Memory

Using compiler-directed approach to create MPI code automatically

Hybrid Parallel Programming

Introduction to OpenMP

Introduction to C Topics Compilation Using the gcc Compiler

Patterns Paraguin Compiler Version 2.1.

Programming with Shared Memory

Functions continued.

Java Programming Language

Course Description Algorithms are: Recipes for solving problems.

Hybrid Parallel Programming

Quiz Questions How does one execute code in parallel in Paraguin?

Corresponds with Chapter 5

Presentation transcript:

Paraguin Compiler Version 2.1

Introduction The Paraguin Compiler is a compiler that I am developing at UNCW (by myself basically) It is based on the SUIF Compiler infrastruction Using pragmas the user can direct the compiler (compiler directives) to produce and MPI program User Manual can be accessed at: http://people.uncw.edu/cferner/Paraguin/userman.pdf

SUIF Compiler System Created by the SUIF Compiler Group at Stanford (suif.stanford.edu) SUIF is an open source compiler intended to promote research in compiler technology Paraguin is built using the SUIF compiler

Compiler Directives The Paraguin compiler is a source to source compiler It transforms a sequential program into a parallel program suitable for execution on a distributed-memory system The result is a parallel program with calls to MPI routines Parallelization is not automatic; but rather directed via pragmas

Compiler Directives The advantage to using pragmas is that other compilers will ignore them You can provide information to Paraguin that is ignored by other compilers, say gcc You can create a hybrid program using pragmas for different compilers Syntax: #pragma paraguin <type> [<parameters>]

Running a Parallel Program When your parallel program is run, you specify how many processors you want on the command line (or in a job submission file) Processes (1 per processor) will be given a rank, which is unique, in the range [0 .. NP-1], where NP is the number of processors. Process 0 is considered to be the master.

Parallel Region … #pragma paraguin begin_parallel #pragma paraguin end_parallel Code inside of the parallel region is executed by all processors Code outside of the parallel region is executed by the master process (with rank = 0) only. All other processors do not execute this code.

Hello World #ifdef PARAGUIN typedef void* __builtin_va_list; #endif #include <stdio.h> int __guin_rank = 0; int main(int argc, char *argv[]) { char hostname[256]; printf("Master process %d starting.\n", __guin_rank); #pragma paraguin begin_parallel gethostname(hostname, 255); printf("Hello world from process %3d on machine %s.\n", __guin_rank, hostname); #pragma paraguin end_parallel printf("Goodbye world from process %d.\n", _guin_rank); }

Explanation of Hello World #ifdef PARAGUIN typedef void* __builtin_va_list; #endif This is here to deal with an incompatibility issue between the SUIF compiler and gcc. Don’t worry about it, but just put it into your program.

Explanation of Hello World #include <stdio.h> int __guin_rank = 0; int main(int argc, char *argv[]) { printf("Master thread %d starting.\n", __guin_rank); … This is a predefined Paraguin identifier. We are allowed to declare it and even initialize it, but it should not be modified. The reason for doing this is so that we can compile this program with gcc (with no modification to the source code) to create a sequential version of the program.

Explanation of Hello World This defines a region to be executed by all processors. Outside of this region, only the master process executes the statements. #pragma paraguin begin_parallel gethostname(hostname, 255); printf("Hello world from process %3d on machine %s.\n", __guin_rank, hostname); #pragma paraguin end_parallel

Explanation of Hello World Only the master process (with rank = 0) executes the code outside a parallel region. The other processors skip it. printf("Master process %d starting.\n", __guin_rank); #pragma paraguin begin_parallel PE 0 PE 1 PE 2 PE 3 PE 4 PE 5 Execute Skip

Explanation of Hello World Only the master process (with rank = 0) executes the code outside a parallel region. The other processors skip it. #pragma paraguin end_parallel printf("Goodbye world from process %d.\n", _guin_rank); } PE 0 PE 1 PE 2 PE 3 PE 4 PE 5 Execute Skip

Result of Hello World Compiling Running All on one line $ scc -DPARAGUIN -D__x86_64__ -I/opt/openmpi/include/ -cc mpicc helloWorld.c -o helloWorld $ mpirun –np 8 hello.out Master process 0 starting. Hello world from process 0 on machine compute-1-5.local. Goodbye world from process 0. Hello world from process 1 on machine compute-1-5.local. Hello world from process 2 on machine compute-1-5.local. Hello world from process 3 on machine compute-1-5.local. Hello world from process 4 on machine compute-1-1.local. Hello world from process 5 on machine compute-1-1.local. Hello world from process 6 on machine compute-1-1.local. Hello world from process 7 on machine compute-1-1.local. Compiling All on one line Running

Notes on pragmas Many times you need an extra semicolon (;) in front of the pragma statements. The reason is to insert a NOOP instruction into the code to which the pragmas are attached SUIF attaches the pragmas to the last instruction, which may be deeply nested. This makes it difficult for Paraguin to find the pragmas Solution: insert a semicolon on a line by itself before a block of pragma statements

Incorrect Location of Pragma for (i = 0; i < n; i++) for (j = 0; j < n; j++) a[i][j] = 0; } #pragma paraguin begin_parallel a[i][j] = 0; #pragma paraguin begin_parallel This code Actually appears like this:

Solution for (i = 0; i < n; i++) for (j = 0; j < n; j++) a[i][j] = 0; } ; #pragma paraguin begin_parallel Solution: put a semicolon in front of pragma Usually, it is needed after a nesting (e.x. for loop nest, while loop nest, etc.)

More on Parallel Regions int f () { #pragma paraguin begin_parallel … #pragma paraguin end_parallel } int g() ... if (a < b) The parallel region pragmas must be at the topmost nesting within a function. This is an error

Parallel Regions Related to Functions int f () { #pragma paraguin begin_parallel … #pragma paraguin end_parallel } int main() f(); If a function is to be executed in parallel, it must have it’s own parallel region, and the call to it must also be in a parallel region This one will execute in parallel This one will execute sequentially, regardless of its own parallel regions.

Initializations Initializations of variables are executable statements (as opposed to the declaration) Therefore, then need to be within a parallel region int f () { int a = 23, b; #pragma paraguin begin_parallel b = 46; … #pragma paraguin end_parallel } a will be initialized on the master only because it is outside a parallel region b will be initialized on all processors

Parallel Constructs All of these must be within a parallel region (some would deadlock if not): #pragma paraguin barrier #pragma paraguin forall #pragma paraguin bcast #pragma paraguin scatter #pragma paraguin gather #pragma paraguin reduce

Barrier A barrier is a point at which all processors stop until they all arrive at the same point, after which they may proceed It’s like a rendezvous PE 0 PE 1 PE 2 PE 3 PE 4 PE 5

Barrier … #pragma paraguin barrier

Parallel For (or forall) To execute a for loop in parallel: #pragma paraguin forall [chunksize] Each processor will execute a different partition of the iterations (call the iteration space) The partitions will be no larger than chunksize number of iterations Default chunksize Where n is the number of iterations and NP is the number of processors

Parallel For (or forall) For example consider: #pragma paraguin forall for (i = 0; i < n; i++) { <body> Suppose n = 13. The iteration space is i=0 i=5 i=10 i=1 i=6 i=11 i=2 i=7 i=12 i=3 i=8 i=4 i=9

Parallel For (or forall) Also suppose we have 4 processors. Default chunksize is The iteration space will be executed by the 4 processors as: PE 0 PE 1 PE2 PE3 i=0 i=4 i=8 i=12 i=1 i=5 i=9 - i=2 i=6 i=10 i=3 i=7 i=11

Parallel For (other notes) Not that the for loop that is executed as a forall must be a simple for loop: The increment must be positive 1 (and the upper bound must be greater than the lower bound) The loop termination must use either < or <= A nested for loop can be a forall: for (i = 0; i < n; i++) { #pragma paraguin forall for (j = 0; j < n; j++) { However, foralls cannot be nested

How to transform for loops to simple for loops Count down loop for (i = n-1; i >=0; i--) { … Nested loops for (i = 0; i < n; i++) { for (j = 0; j < n; j++) { #pragma paraguin forall for (tmp = 0; tmp < n; tmp++) { i = n – tmp – 1; … for (tmp = 0; tmp < n*n; tmp++) { i = tmp / n; j = tmp % n;

Parallel For (other notes) If the user provides a chunksize, then each processor cycles through chunksize iterations in a cyclic fashion Specifying a chunksize of 1 is cyclic scheduling (better load balancing) PE 0 PE 1 PE2 PE3 i=0 i=1 i=2 i=3 i=4 i=5 i=6 i=7 i=8 i=9 i=10 i=11 i=12 -

Broadcast Broadcasting data sends the same data to all processor from the master #pragma paraguin bcast <list of variables> Broadcast is likely to be faster than individual message int a, b[N][M], n; char *s = “hello world”; n = strlen(s) + 1; #pragma paraguin begin_parallel #pragma paraguin bcast a b n s( n )

Broadcast int a, b[N][M], n; char *s = “hello world”; n = strlen(s) + 1; #pragma paraguin begin_parallel #pragma paraguin bcast a b n s( n ) Variable a is a scalar and b is an array, but the correct number of bytes are broadcast N*M*sizeof(int) bytes are broadcast for variable b.

Broadcast int a, b[N][M], n; char *s = “hello world”; n = strlen(s) + 1; #pragma paraguin begin_parallel #pragma paraguin bcast a b n s( n ) Variable s is a string or a pointer. There is no way to know how big the data actually is Pointers require a size (such as s( n )) If the size is not given then only one character will be broadcast

Broadcast Notice that s and n are initialize on the master only int a, b[N][M], n; char *s = “hello world”; n = strlen(s) + 1; #pragma paraguin begin_parallel #pragma paraguin bcast a b n s( n ) Notice that s and n are initialize on the master only 1 is added to strlen(s) to include null character Variable n must be broadcast BEFORE variable s Put spaces between parentheses and size (e.g. ( n ))

Scatter Scattering data divides up data that resides on the master among the other processors #pragma paraguin scatter <list of variables>

Scatter void f(int *A, int n) { int B[N]; … // Initialize B somehow #pragma paraguin begin_parallel #pragma paraguin scatter A( n ) B ... Same thing applies for pointers with scatter as with broadcast. The size must be given. Only arrays should be scatter (it makes no sense to scatter a scalar).

Scatter The default chunksize is where N is the number of rows and NP is the number of processors Notice that the rows are scattered, not columns User defined chunksize is not yet implemented

Gather Gather works just like Scatter except that the data moves in the opposite direction #pragma paraguin gather <list of variables>

Gather Gather is the collection of partial results back to the master The default chunksize is where N is the number of rows and NP is the number of processors User defined chunksize is not yet implemented

Reduction A reduction is when a binary commutative operator is applied to a collection of values producing a single value #pragma paraguin reduce <op> <source> <result> Where <op> is the operator <source> is the variable with the data to be reduced <result> is the variable that will hold the answer

Reduction For example, applying summation to the following values: Produces the single value of 549 MPI does not specify how reduction should be implemented; however, … 83 40 23 85 90 2 74 68 51 33

Reduction A reduction could be implemented fairly efficiently on multiple processor using a tree In which case the time is O(log(NP))

Reduction Available operators that can be used in a reduction: Description max Maximum lor Logical or min Minimum bor Bitwise or sum Summation lxor Logical exclusive or prod Product bxor Bitwise exclusive or land Logical and maxloc Maximum and location Band Bitwise and minloc Minimum and location

Reduction double c, result_c; ... #pragma paraguin begin_parallel // Each processor assigns some value to the variable c #pragma paraguin reduce sum c result_c // The variable result_c on the master now holds the result // of summing the values of the variable c on all the // processors

Reducing an Array When a reduction is applied to an array, the corresponding values in the same relative position in the array are reduced across processors double c[N], result_c[N]; ... #pragma paraguin begin_parallel // Each processor assigns N values to the array c #pragma paraguin reduce sum c result_c

Reducing an Array

Next Topic Patterns: Scatter/Gather Stencil

Questions?