Hybrid Parallel Programming with the Paraguin compiler

Slides:

Advertisements

Similar presentations

Parallel Processing with OpenMP

Advertisements

Introduction to Openmp & openACC

Starting Parallel Algorithm Design David Monismith Based on notes from Introduction to Parallel Programming 2 nd Edition by Grama, Gupta, Karypis, and.

1 Programming Explicit Thread-level Parallelism  As noted previously, the programmer must specify how to parallelize  But, want path of least effort.

Open[M]ulti[P]rocessing Pthreads: Programmer explicitly define thread behavior openMP: Compiler and system defines thread behavior Pthreads: Library independent.

Toward using higher-level abstractions to teach Parallel Computing 5/20/2013 (c) Copyright 2013 Clayton S. Ferner, UNC Wilmington1 Clayton Ferner, University.

Nested Loops. Problem Print The only printfs you can use are: –printf(“*”); –printf(“\n”); *****

1 Friday, November 10, 2006 “ Programs for sale: Fast, Reliable, Cheap: choose two.” -Anonymous.

Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.

1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 21, 2012 Programming with Shared Memory Introduction to OpenMP.

OpenMPI Majdi Baddourah

CS 470/570 Lecture 7 Dot Product Examples Odd-even transposition sort More OpenMP Directives.

1 Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as a i,j and elements of.

Programming with Shared Memory Introduction to OpenMP

CS470/570 Lecture 5 Introduction to OpenMP Compute Pi example OpenMP directives and options.

1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 5 Shared Memory Programming with OpenMP An Introduction to Parallel Programming Peter Pacheco.

Upcrc.illinois.edu OpenMP Lab Introduction. Compiling for OpenMP Open project Properties dialog box Select OpenMP Support from C/C++ -> Language.

Parallel Edge Detection Daniel Dobkin Asaf Nitzan.

OpenMP Blue Waters Undergraduate Petascale Education Program May 29 – June

1 " Teaching Parallel Design Patterns to Undergraduates in Computer Science” Panel member SIGCSE The 45 th ACM Technical Symposium on Computer Science.

High-Performance Parallel Scientific Computing 2008 Purdue University OpenMP Tutorial Seung-Jai Min School of Electrical and Computer.

Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (

Parallel Systems Lecture 10 Dr. Guy Tel-Zur. Administration Home assignments status Final presentation status – Open Excel file ps2013a.xlsx Allinea DDT.

Using Compiler Directives Paraguin Compiler 1 © 2013 B. Wilkinson/Clayton Ferner SIGCSE 2013 Workshop 310 session2a.ppt Modification date: Jan 9, 2013.

Special Topics in Computer Engineering OpenMP* Essentials * Open Multi-Processing.

Heterogeneous Computing using openMP lecture 2 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

CS240A, T. Yang, Parallel Programming with OpenMP.

Heterogeneous Computing using openMP lecture 1 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

1 ITCS4145 Parallel Programming B. Wilkinson March 23, hybrid-abw.ppt Hybrid Parallel Programming Introduction.

Data Parallel Computations and Pattern ITCS 4/5145 Parallel computing, UNC-Charlotte, B. Wilkinson, slides6c.ppt Nov 4, c.1.

Introduction to OpenMP

Open[M]ulti[P]rocessing

Sieve of Eratosthenes.

Introduction to OpenMP

SHARED MEMORY PROGRAMMING WITH OpenMP

Paraguin Compiler Examples.

Lab. 3 (May 6st) You may use either cygwin or visual studio for using OpenMP Compiling in cygwin “> gcc –fopenmp ex1.c” will generate a.exe Execute : “>

OpenMP Quiz B. Wilkinson January 22, 2016.

Sieve of Eratosthenes.

Instructor’s Intent for this course

Parallel Graph Algorithms

Using compiler-directed approach to create MPI code automatically

Multi-core CPU Computing Straightforward with OpenMP

Parallel Programming with OpenMP

Paraguin Compiler Examples.

Using compiler-directed approach to create MPI code automatically

Hybrid Parallel Programming

Lab. 3 (May 11th) You may use either cygwin or visual studio for using OpenMP Compiling in cygwin “> gcc –fopenmp ex1.c” will generate a.exe Execute :

Paraguin Compiler Communication.

Paraguin Compiler Version 2.1.

Paraguin Compiler Examples.

Programming with Shared Memory Introduction to OpenMP

Paraguin Compiler Version 2.1.

Questions Parallel Programming Shared memory performance issues

Hybrid Parallel Programming

Using compiler-directed approach to create MPI code automatically

Hybrid Parallel Programming

Introduction to OpenMP

Patterns Paraguin Compiler Version 2.1.

Loop Optimization “Programs spend 90% of time in loops”

Lab. 3 (May 1st) You may use either cygwin or visual studio for using OpenMP Compiling in cygwin “> gcc –fopenmp ex1.c” will generate a.exe Execute : “>

Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as ai,j and elements of B as.

Image Filtering with GLSL

Hybrid Parallel Programming

Data Parallel Pattern 6c.1

Quiz Questions How does one execute code in parallel in Paraguin?

Data Parallel Computations and Pattern

OpenMP Parallel Programming

Data Parallel Computations and Pattern

Presentation transcript:

Hybrid Parallel Programming with the Paraguin compiler

The Paraguin compiler can also create hybrid programs This is because it uses mpicc, it will pass the OpenMP pragma through to the resulting source

Compiling First we need to compile to source code scc -DPARAGUIN -D__x86_64__ matrixmult.c -.out.c Then we can compile with MPI and openmp mpicc –fopenmp matrixmult.out.c –o matrixmult.out

Hybrid Matrix Multiplication using Paraguin #pragma paraguin begin_parallel #pragma paraguin scatter a #pragma paraguin bcast b #pragma paraguin forall for (i = 0; i < N; i++) { #pragma omp parallel for private(tID, j,k) num_threads(4) for (j = 0; j < N; j++) { c[i][j] = 0.0; for (k = 0; k < N; k++) { c[i][j] = c[i][j] + a[i][k] * b[k][j]; } The i loop will be partitioned among the computers The j loop will be partitioned among the 4 cores within a computer

Debug Statements <pid 0, thread 1>: c[0][1] += a[0][0] * b[1][0] <pid 0, thread 1>: c[0][1] += a[0][1] * b[1][1] <pid 0, thread 1>: c[0][1] += a[0][2] * b[1][2] <pid 0, thread 2>: c[0][2] += a[0][0] * b[2][0] <pid 0, thread 2>: c[0][2] += a[0][1] * b[2][1] <pid 0, thread 2>: c[0][2] += a[0][2] * b[2][2] <pid 1, thread 1>: c[1][1] += a[1][0] * b[1][0] <pid 1, thread 1>: c[1][1] += a[1][1] * b[1][1] <pid 1, thread 1>: c[1][1] += a[1][2] * b[1][2] <pid 2, thread 1>: c[2][1] += a[2][0] * b[1][0] <pid 2, thread 1>: c[2][1] += a[2][1] * b[1][1] <pid 2, thread 1>: c[2][1] += a[2][2] * b[1][2] <pid 0, thread 0>: c[0][0] += a[0][0] * b[0][0]

Debug Statements <pid 0, thread 0>: c[0][0] += a[0][1] * b[0][1] <pid 0, thread 0>: c[0][0] += a[0][2] * b[0][2] <pid 2, thread 0>: c[2][0] += a[2][0] * b[0][0] <pid 2, thread 0>: c[2][0] += a[2][1] * b[0][1] <pid 2, thread 0>: c[2][0] += a[2][2] * b[0][2] <pid 1, thread 0>: c[1][0] += a[1][0] * b[0][0] <pid 1, thread 0>: c[1][0] += a[1][1] * b[0][1] <pid 1, thread 0>: c[1][0] += a[1][2] * b[0][2] <pid 1, thread 2>: c[1][2] += a[1][0] * b[2][0] <pid 1, thread 2>: c[1][2] += a[1][1] * b[2][1] <pid 1, thread 2>: c[1][2] += a[1][2] * b[2][2] <pid 2, thread 2>: c[2][2] += a[2][0] * b[2][0] <pid 2, thread 2>: c[2][2] += a[2][1] * b[2][1] <pid 2, thread 2>: c[2][2] += a[2][2] * b[2][2]

Sobel Edge Detection Given an image, the problem is to detect where the “edges” are in the picture

Sobel Edge Detection

Sobel Edge Detection Algorithm /* 3x3 Sobel masks. */ GX[0][0] = -1; GX[0][1] = 0; GX[0][2] = 1; GX[1][0] = -2; GX[1][1] = 0; GX[1][2] = 2; GX[2][0] = -1; GX[2][1] = 0; GX[2][2] = 1; GY[0][0] = 1; GY[0][1] = 2; GY[0][2] = 1; GY[1][0] = 0; GY[1][1] = 0; GY[1][2] = 0; GY[2][0] = -1; GY[2][1] = -2; GY[2][2] = -1; for(x=0; x < N; ++x){ for(y=0; y < N; ++y){ sumx = 0; sumy = 0; // handle image boundaries if(x==0 || x==(h-1) || y==0 || y==(w-1)) sum = 0; else{

Sobel Edge Detection Algorithm //x gradient approx for(i=-1; i<=1; i++) for(j=-1; j<=1; j++) sumx += (grayImage[x+i][y+j] * GX[i+1][j+1]); //y gradient approx sumy += (grayImage[x+i][y+j] * GY[i+1][j+1]); //gradient magnitude approx sum = (abs(sumx) + abs(sumy)); } edgeImage[x][y] = clamp(sum); There are no loop-carried dependencies. Therefore, this is a Scatter/Gather pattern.

Sobel Edge Detection Algorithm Inputs (that need to be broadcast or scattered): GX and GY arrays grayImage array w and h (width and height) There are 4 nested loops (x, y, i, and j) The final answer is the array edgeImage

Hybrid Sobel Edge Detection using Paraguin #pragma paraguin begin_parallel /* 3x3 Sobel masks. */ GX[0][0] = -1; GX[0][1] = 0; GX[0][2] = 1; GX[1][0] = -2; GX[1][1] = 0; GX[1][2] = 2; GX[2][0] = -1; GX[2][1] = 0; GX[2][2] = 1; GY[0][0] = 1; GY[0][1] = 2; GY[0][2] = 1; GY[1][0] = 0; GY[1][1] = 0; GY[1][2] = 0; GY[2][0] = -1; GY[2][1] = -2; GY[2][2] = -1; #pragma paraguin bcast grayImage w h

Hybrid Sobel Edge Detection using Paraguin #pragma paraguin forall #pragma omp parallel for private(x,y,i,j,sumx,sumy,sum) shared(w,h) num_threads(4) for(x=0; x < N; ++x){ for(y=0; y < N; ++y){ sumx = 0; sumy = 0; ... edgeImage[x][y] = clamp(sum); } ; #pragma paraguin gather edgeImage #pragma paraguin end_parallel The x loop is partitioned 1st amount the computers, then 2nd again among the cores

What does not work with Paraguin Syntax: #pragma omp parallel structured_block Example: #pragma omp parallel private(tID) num_threads(4) { tID = omp_get_thread_num(); printf("<pid %d>: tid = %d\n", __guin_rank, tID); } Very Important Opening brace must be on a new line

What does not work with Paraguin The SUIF compiler removes the braces because they are not associated with a control structure A #pragma is not a control structure, but rather a pre-processor directive. After compiling with scc: #pragma omp parallel private(tID) num_threads(4) tID = omp_get_thread_num(); printf("<pid %d>: tid = %d\n", __guin_rank, tID); Braces are removed

The Fix The trick is to put in a control structure that basically does nothing: dummy = 0; #pragma omp parallel private(tID) num_threads(4) if (dummy == 0) { tID = omp_get_thread_num(); printf ("<pid %d>: tid = %d\n", __guin_rank, tID); } “if (1)” does not work If statement will always be true. This code is basically left intact.

Result <pid 1>: tid = 1 <pid 1>: tid = 2 <pid 1>: tid = 3 <pid 2>: tid = 2 <pid 0>: tid = 3 <pid 1>: tid = 0 <pid 2>: tid = 3 <pid 3>: tid = 2 <pid 3>: tid = 1 <pid 0>: tid = 2 <pid 2>: tid = 0 <pid 0>: tid = 1 <pid 3>: tid = 0 <pid 3>: tid = 3 <pid 0>: tid = 0

Questions