Paraguin Compiler Communication.

Slides:



Advertisements
Similar presentations
Parallel Jacobi Algorithm Steven Dong Applied Mathematics.
Advertisements

Slide-1 University of Maryland Five Common Defect Types in Parallel Computing Prepared for Applied Parallel Computing Prof. Alan Edelman Taiga Nakamura.
Practical techniques & Examples
1 Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.
CSE5304—Project Proposal Parallel Matrix Multiplication Tian Mi.
Chapter 6 Floyd’s Algorithm. 2 Chapter Objectives Creating 2-D arrays Thinking about “grain size” Introducing point-to-point communications Reading and.
Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.
12d.1 Two Example Parallel Programs using MPI UNC-Wilmington, C. Ferner, 2007 Mar 209, 2007.
Today Objectives Chapter 6 of Quinn Creating 2-D arrays Thinking about “grain size” Introducing point-to-point communications Reading and printing 2-D.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
CS 584 l Assignment. Systems of Linear Equations l A linear equation in n variables has the form l A set of linear equations is called a system. l A solution.
Hybrid MPI and OpenMP Parallel Programming
1 " Teaching Parallel Design Patterns to Undergraduates in Computer Science” Panel member SIGCSE The 45 th ACM Technical Symposium on Computer Science.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
CS 591 x I/O in MPI. MPI exists as many different implementations MPI implementations are based on MPI standards MPI standards are developed and maintained.
Using Compiler Directives Paraguin Compiler 1 © 2013 B. Wilkinson/Clayton Ferner SIGCSE 2013 Workshop 310 session2a.ppt Modification date: Jan 9, 2013.
1/50 University of Turkish Aeronautical Association Computer Engineering Department Ceng 541 Introduction to Parallel Computing Dr. Tansel Dökeroğlu
CS 221 – May 22 Timing (sections 2.6 and 3.6) Speedup Amdahl’s law – What happens if you can’t parallelize everything Complexity Commands to put in your.
Suzaku Pattern Programming Framework (a) Structure and low level patterns © 2015 B. Wilkinson Suzaku.pptx Modification date February 22,
ITCS 4/5145 Parallel Computing, UNC-Charlotte, B
1 ITCS4145 Parallel Programming B. Wilkinson March 23, hybrid-abw.ppt Hybrid Parallel Programming Introduction.
Application of Design Patterns to Geometric Decompositions V. Balaji, Thomas L. Clune, Robert W. Numrich and Brice T. Womack.
Embedded Systems MPSoC Architectures OpenMP: Exercises Alberto Bosio
A few words on locality and arrays
Introduction to OpenMP
Using Paraguin to Create Parallel Programs
Hybrid Parallel Programming with the Paraguin compiler
Sieve of Eratosthenes.
The University of Adelaide, School of Computer Science
Introduction to OpenMP
Parallel Programming in C with MPI and OpenMP
Paraguin Compiler Examples.
Java MPI in MATLAB*P Max Goldman Da Guo.
Sieve of Eratosthenes.
Parallel Sorting Algorithms
Parallel Graph Algorithms
CS4230 Parallel Programming Lecture 12: More Task Parallelism Mary Hall October 4, /04/2012 CS4230.
Parallel Programming with MPI and OpenMP
Using compiler-directed approach to create MPI code automatically
CSCE569 Parallel Computing
Paraguin Compiler Examples.
ITCS 4/5145 Parallel Computing, UNC-Charlotte, B
Parallel Processing - MPI
Parallel Matrix Operations
Using compiler-directed approach to create MPI code automatically
Hybrid Parallel Programming
Paraguin Compiler Version 2.1.
Paraguin Compiler Examples.
Paraguin Compiler Version 2.1.
Pipeline Pattern ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, 2012 slides5.ppt Oct 24, 2013.
CSCE569 Parallel Computing
Pattern Programming Tools
Dr. Tansel Dökeroğlu University of Turkish Aeronautical Association Computer Engineering Department Ceng 442 Introduction to Parallel.
Introduction to parallelism and the Message Passing Interface
ITCS 4/5145 Parallel Computing, UNC-Charlotte, B
Hybrid Parallel Programming
Using compiler-directed approach to create MPI code automatically
Creating Computer Programs
Hybrid Parallel Programming
Introduction to OpenMP
Patterns Paraguin Compiler Version 2.1.
Tonga Institute of Higher Education IT 141: Information Systems
CS 584 Project Write up Poster session for final Due on day of final
Tonga Institute of Higher Education IT 141: Information Systems
Parallel Graph Algorithms
Parallel Programming in C with MPI and OpenMP
Hybrid Parallel Programming
Creating Computer Programs
Quiz Questions How does one execute code in parallel in Paraguin?
Presentation transcript:

Paraguin Compiler Communication

I/O Now that we have parallelized the main loop nest of the program we need to get the input data from a file and to all the processors So how do we do this?

I/O First, we will have the master thread only read the input from a file. This is because we aren’t using parallel file I/O. This is easily done by putting file I/O outside a parallel region

I/O if (argc < 2) { fprintf (stderr, usage, argv[0]); return -1; } if ((fd = fopen (argv[1], "r")) == NULL) { fprintf (stderr, "%s: Cannot open file %s for reading.\n“, argv[0], argv[1]); for (i = 0; i < N; i++) for (j = 0; j < N; j++) fscanf (fd, "%f", &a[i][j]); ; #pragma paraguin begin_parallel

How do we get the data to the other processors MPI provides a number of ways to do this Since MPI command can be put directly in the source code, you can still do everything in Paraguin that you can in MPI. Ex. #ifdef PARAGUIN MPI_Scatter(a, blksz, MPI_FLOAT, b , blksz, MPI_FLOAT, 0, MPI_COMM_WORLD); #endif

Paraguin only provides a Broadcast This send the entire data set to the other processors Its complexity is O(log2NP) as opposed to O(NP), where NP is the number of processors Ex. ; #pragma paraguin bcast a This should be inside a parallel region

Gather Paraguin does provide a mechanism for gather; however, it does use the MPI_Gather and isn’t as efficient. Paraguin will send individual message from each processor back to the master thread To use the Paraguin gather, you did to specify which array access and values are the final values.

Syntax: #pragma paraguin gather <array reference> <matrix> Where <matrix> is a system of inequalities that identifies which iterations produce the final values

Example, array reference 0 and iterations where i=j-1: #pragma paraguin gather 0x0 C i j k \ -1 -1 1 0x0 \ 1 1 -1 0x0 So what is the array reference? Array references are enumerated started at 0 Iterations where i=j-1 is when the value of the array is written and not changed again (final value)

Consider the elimination step of Gaussian Elimination: for (i = 1; i <= N; i++) for (j = i+1; j <= N; j++) for (k = N+1; k >= i; k--) a[j][k] = a[j][k] - a[i][k] * a[j][i] / a[i][i]; Reference 0 Reference 2 Reference 4 Reference 1 Reference 3

Still You Can Use MPI Commands Remember, you can still put in MPI Commands directly #ifdef PARAGUIN MPI_Gather(a, blksz, MPI_FLOAT, b , blksz, MPI_FLOAT, 0, MPI_COMM_WORLD); #endif

Dependency and Communication If there is a data dependence across iterations (loop carried dependency) then this data must be transmitted from the processor that computes the value to the processor that needs it Paraguin can generate the MPI code to perform the communication You need to tell it about the dependency

Data Dependency Consider the elimination step of Gaussian Elimination: for (i = 1; i <= N; i++) for (j = i+1; j <= N; j++) for (k = N+1; k >= i; k--) a[j][k] = a[j][k] - a[i][k] * a[j][i] / a[i][i]; Reference 0 Reference 2

This is called a loop-carried dependence Data Dependency This is called a loop-carried dependence < i, j, k> <1, 2, 5>: a[2][5] = a[2][5] - a[1][5] * a[2][1] / a[1][1] <1, 2, 4>: a[2][4] = a[2][4] - a[1][4] * a[2][1] / a[1][1] <1, 2, 3>: a[2][3] = a[2][3] - a[1][3] * a[2][1] / a[1][1] <1, 2, 2>: a[2][2] = a[2][2] - a[1][2] * a[2][1] / a[1][1] <1, 2, 1>: a[2][1] = a[2][1] - a[1][1] * a[2][1] / a[1][1] ... <2, 3, 5>: a[3][5] = a[3][5] - a[2][5] * a[3][2] / a[2][2] <2, 3, 4>: a[3][4] = a[3][4] - a[2][4] * a[3][2] / a[2][2] <2, 3, 3>: a[3][3] = a[3][3] - a[2][3] * a[3][2] / a[2][2] <2, 3, 2>: a[3][2] = a[3][2] - a[2][2] * a[3][2] / a[2][2]

Data Dependency So we need to tell Paraguin which array references depend on which references and the mapping of the read iteration instance to the write iteration instance The format for a dependence pragma is: #pragma paraguin dep <write array reference> <read array reference> <matrix> Where <matrix> is a system of inequalities that maps the read iteration instance to the write iteration instance

Data Dependency We are going to skip the details of specifying loop-carried data dependence Why? because it is complicated, AND …

Results from Gaussian Elimination

I have been trying to figure out how to get better performance on problems like LU Decomposition and Gaussian Elimination Even with the progress made, we still can’t get speedup when there is inter-processor communication from loop-carried dependencies

“Until inter-processor communication latency is at least a fast as memory access latency, we won’t achieve speedup on a distributed- memory system for problems that require inter- process communication beyond scattering the input data and gathering the partial results.” [Ferner:2012]

So what good are distributed-memory systems? Question: What can we run on distributed- memory systems and achieve speedup? Answer: Parallel programs that do not need inter-processor communication, beyond the scatter and gather process In other words: “embarrassingly parallel” applications

Communication Patterns Like This After the scattering of input data and before the gathering of partial results, the processors work independently Scatter 1 2 3 4 5 6 7 Gather

Examples Matrix Multiplication (the obvious algorithm) Sobel Edge Detection Monte Carlo algorithms Traveling Salesman Problem Several Spec Benchmark algorithms