PC07LaplacePerformance Parallel Computing 2007: Performance Analysis for Laplace’s Equation February 26-March 1 2007 Geoffrey Fox Community.

Slides:



Advertisements
Similar presentations
Basic Communication Operations
Advertisements

Introduction to Algorithms Quicksort
Parallel Jacobi Algorithm Steven Dong Applied Mathematics.
P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.
MATH 224 – Discrete Mathematics
Chapter 4: Trees Part II - AVL Tree
Arithmetic Coding. Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a How we can do better than Huffman? - I As we have seen, the.
Fundamentals of Python: From First Programs Through Data Structures
CIS December '99 Introduction to Parallel Architectures Dr. Laurence Boxer Niagara University.
CS0007: Introduction to Computer Programming Array Algorithms.
11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.
Vertices and Fragments I CS4395: Computer Graphics 1 Mohan Sridharan Based on slides created by Edward Angel.
Introduction to Analysis of Algorithms
Computer Science 1620 Variables and Memory. Review Examples: write a program that calculates and displays the average of the numbers 45, 69, and 106.
Chapter 2: Algorithm Discovery and Design
ECE669 L5: Grid Computations February 12, 2004 ECE 669 Parallel Computer Architecture Lecture 5 Grid Computations.
Topic Overview One-to-All Broadcast and All-to-One Reduction
Hashing General idea: Get a large array
C++ for Engineers and Scientists Third Edition
CS107 Introduction to Computer Science Lecture 7, 8 An Introduction to Algorithms: Efficiency of algorithms.
Chapter 13 Finite Difference Methods: Outline Solving ordinary and partial differential equations Finite difference methods (FDM) vs Finite Element Methods.
A Brief Look At MPI’s Point To Point Communication Brian T. Smith Professor, Department of Computer Science Director, Albuquerque High Performance Computing.
Chapter 2: Algorithm Discovery and Design
Chapter 2: Algorithm Discovery and Design
CHAPTER 7: SORTING & SEARCHING Introduction to Computer Science Using Ruby (c) Ophir Frieder at al 2012.
Assignment Solving System of Linear Equations Using MPI Phạm Trần Vũ.
Pointers (Continuation) 1. Data Pointer A pointer is a programming language data type whose value refers directly to ("points to") another value stored.
Parallel Numerical Integration Spring Semester 2005 Geoffrey Fox Community Grids Laboratory Indiana University 505 N Morton Suite 224 Bloomington IN
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Chapter 7: Arrays. In this chapter, you will learn about: One-dimensional arrays Array initialization Declaring and processing two-dimensional arrays.
Heterogeneous Parallelization for RNA Structure Comparison Eric Snow, Eric Aubanel, and Patricia Evans University of New Brunswick Faculty of Computer.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Basic Communication Operations Based on Chapter 4 of Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis and Vipin Kumar These.
1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors.
Invitation to Computer Science, Java Version, Second Edition.
Molecular Dynamics Sathish Vadhiyar Courtesy: Dr. David Walker, Cardiff University.
Computational issues in Carbon nanotube simulation Ashok Srinivasan Department of Computer Science Florida State University.
Chapter 8 Problems Prof. Sin-Min Lee Department of Mathematics and Computer Science.
Topic Overview One-to-All Broadcast and All-to-One Reduction All-to-All Broadcast and Reduction All-Reduce and Prefix-Sum Operations Scatter and Gather.
Today  Table/List operations  Parallel Arrays  Efficiency and Big ‘O’  Searching.
Planned AlltoAllv a clustered approach Stephen Booth (EPCC) Adrian Jackson (EPCC)
Sept COMP60611 Fundamentals of Concurrency Lab Exercise 2 Notes Notes on the finite difference performance model example – for the lab… Graham Riley,
LogP Model Motivation BSP Model Limited to BW of Network (g) and Load of PE Requires large load per super steps. Need Better Models for Portable Algorithms.
CSC 211 Data Structures Lecture 13
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Communication and Computation on Arrays with Reconfigurable Optical Buses Yi Pan, Ph.D. IEEE Computer Society Distinguished Visitors Program Speaker Department.
Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Reduced slides for CSCE 3030 To accompany the text ``Introduction.
Message Passing and MPI Laxmikant Kale CS Message Passing Program consists of independent processes, –Each running in its own address space –Processors.
CSCI-455/552 Introduction to High Performance Computing Lecture 23.
Intro To Algorithms Searching and Sorting. Searching A common task for a computer is to find a block of data A common task for a computer is to find a.
27-Jan-16 Analysis of Algorithms. 2 Time and space To analyze an algorithm means: developing a formula for predicting how fast an algorithm is, based.
MSc in High Performance Computing Computational Chemistry Module Parallel Molecular Dynamics (i) Bill Smith CCLRC Daresbury Laboratory
Image Processing A Study in Pixel Averaging Building a Resolution Pyramid With Parallel Computing Denise Runnels and Farnaz Zand.
Basic Communication Operations Carl Tropper Department of Computer Science.
Copyright © 2014 Curt Hill Algorithms From the Mathematical Perspective.
April 24, 2002 Parallel Port Example. April 24, 2002 Introduction The objective of this lecture is to go over a simple problem that illustrates the use.
Chapter 15 Running Time Analysis. Topics Orders of Magnitude and Big-Oh Notation Running Time Analysis of Algorithms –Counting Statements –Evaluating.
Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.
Parallel Programming for Wave Equation
ChaNGa: Design Issues in High Performance Cosmology
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Interconnection Networks (Part 2) Dr.
Algorithm Analysis CSE 2011 Winter September 2018.
Parallel Sorting Algorithms
Parallel Matrix Multiplication and other Full Matrix Algorithms
CSCE569 Parallel Computing
Sathish Vadhiyar Courtesy: Dr. David Walker, Cardiff University
Parallel Sorting Algorithms
COMP60621 Designing for Parallelism
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
Analysis of Algorithms
Presentation transcript:

PC07LaplacePerformance Parallel Computing 2007: Performance Analysis for Laplace’s Equation February 26-March Geoffrey Fox Community Grids Laboratory Indiana University 505 N Morton Suite 224 Bloomington IN

PC07LaplacePerformance Abstract of Parallel Programming for Laplace’s Equation This takes Jacobi Iteration for Laplace's Equation in a 2D square and uses this to illustrate: Programming in both Data Parallel (HPF) and Message Passing (MPI and a simplified Syntax) SPMD -- Single Program Multiple Data -- Programming Model Stencil dependence of Parallel Program and use of Guard Rings Collective Communication Basic Speed Up, Efficiency and Performance Analysis with edge over area dependence and consideration of load imbalance and communication overhead effect

PC07LaplacePerformance Potential in a Vacuum Filled Rectangular Box So imagine the world’s simplest PDE problem Find the electrostatic potential inside a box whose sides are at a given potential Set up a 16 by 16 Grid on which potential defined and which must satisfy Laplace’s Equation

PC07LaplacePerformance Basic Sequential Algorithm Initialize the internal 14 by 14 grid to anything you like and then apply for ever!  New = (  Left +  Right +  Up +  Down ) / 4 UpUp  Down  Left  Right  New

PC07LaplacePerformance Update on the Grid 14 by 14 Internal Grid

PC07LaplacePerformance Parallelism is Straightforward If one has 16 processors, then decompose geometrical area into 16 equal parts Each Processor updates 9 12 or 16 grid points independently

PC07LaplacePerformance Communication is Needed Updating edge points in any processor requires communication of values from neighboring processor For instance, the processor holding green points requires red points

PC07LaplacePerformance Red points on edge are known values of  from boundary values Red points in circles are communicated by neighboring processor In update of a processor points, communicated and boundary value points act similarly

PC07LaplacePerformance Sequential Programming for Laplace I We give the Fortran version of this C versions are available on npac.org/projects/cdroms/cewes vol1/cps615course/index.html npac.org/projects/cdroms/cewes vol1/cps615course/index.html If you are not familiar with Fortran –AMAX1 calculates the maximum value of its arguments –Do n starts a for loop ended with statement labeled n –Rest is obvious from “English” implication of Fortran command We start with one dimensional Laplace equation –d 2 Φ/d 2 x = 0 for a ≤ x ≤ b with known values of Φ at end-points x=a and x=b –But also give the sequential two dimensional version

PC07LaplacePerformance Sequential Programming for Laplace II We only discuss detailed parallel code for the one dimensional case; the online resource has three versions of the two dimensional case –The second version using MPI_SENDRECV is most similar to discussion here This particular ordering of updates given in this code is called Jacobi’s method; we will discuss different orderings later In one dimension we apply Φ new (x) = 0.5*(Φ old (x-left) + Φ old (x-right)) for all grid points x and Then replace Φ old (x) by Φ new (x)

PC07LaplacePerformance One Dimensional Laplace’s Equation Left neighbor Typical Grid Point x Right Neighbor 1NTOT2

PC07LaplacePerformance

PC07LaplacePerformance Sequential Programming with Guard Rings The concept of guard rings/points/”halos” is well known in sequential case where one has for a trivial example in one dimension (shown above) 16 points. The end points are fixed boundary values One could save space and dimension PHI(14) and use boundary values by statements for I=1,14 like –IF( I.EQ.1) Valueonleft = BOUNDARYLEFT –ELSE Valueonleft = PHI(I-1) etc. But this is slower and clumsy to program due to conditionals INSIDE Loop and one dimensions instead PHI(16) storing boundary values in PHI(1) and PHI(16) –Updates are performed as DO I =2,15 –and without any test Valueonleft = PHI(I-1)

PC07LaplacePerformance Sequential Guard Rings in Two Dimensions In analogous 2D sequential case, one could dimension array PHI(  ) to PHI(14,14) to hold updated points only. However then points on the edge would need special treatment so that one uses boundary values in update Rather dimension PHI(16,16) to include internal and boundary points Run loops over x(I) and y(J) from 2 to 15 to only cover internal points Preload boundary values in PHI(1,. ), PHI(., 16), PHI(.,1), PHI(.,16) This is easier and faster as no conditionals (IF statements) in inner loops

PC07LaplacePerformance Parallel Guard Rings in One Dimension Now we decompose our 16 points (trivial example) into four groups and put 4 points in each processor Instead of dimensioning PHI(4) in each processor, one dimensions PHI(6) and runs loops from 2 to 5 with either boundary values or communication setting values of end-points Sequential: Parallel: PHI(6) for Processor 1

PC07LaplacePerformance Summary of Parallel Guard Rings in One Dimension In bi-color points, upper color is “owning processor” and bottom color is that of processor that needs value for updating neighboring point Owned by Green -- needed by Yellow Owned by Yellow -- needed by Green

PC07LaplacePerformance Setup of Parallel Jacobi in One Dimension Processor 0Processor 1Processor 2Processor 3 1 2(I1) NLOC1 NLOC (I1) NLOC1NLOC1+1 Boundary

PC07LaplacePerformance

PC07LaplacePerformance

PC07LaplacePerformance “Dummy”

PC07LaplacePerformance Blocking SEND Problems I 012 NPROC-2NPROC-1 SEND 012 NPROC-2NPROC-1 RECV Followed by: BAD!! This is bad as 1 cannot call RECV until its SEND completes but 2 will only call RECV (and complete SEND from 1) when its SEND completes and so on –A “race” condition which is inefficient and often hangs

PC07LaplacePerformance Blocking SEND Problems II 012 NPROC-2NPROC-1 SENDRECVSEND 012 NPROC-2NPROC-1 SENDRECV Followed by: RECV GOOD!! This is good whatever implementation of SEND and so is a safe and recommended way to program If SEND returns when a buffer in receiving node accepts message, then naïve version works –Buffered messaging is safer but costs performance as there is more copying of data

PC07LaplacePerformance

PC07LaplacePerformance Built in Implementation of MPSHIFT in MPI

PC07LaplacePerformance How not to Find Maximum One could calculate the global maximum by: Each processor calculates maximum inside its node –Processor 1 Sends its maximum to node 0 –Processor 2 Sends its maximum to node 0 –………………. –Processor NPROC-2 Sends its maximum to node 0 –Processor NPROC-1 Sends its maximum to node 0 The RECV’s on processor 0 are sequential Processor 0 calculates maximum of its number and the NPROC-1 received numbers –Processor 0 Sends its maximum to node 1 –Processor 0 Sends its maximum to node 2 –………………. –Processor 0 Sends its maximum to node NPROC-2 –Processor 0 Sends its maximum to node NPROC-1 This is correct but total time is proportional to NPROC and does NOT scale

PC07LaplacePerformance How to Find Maximum Better One can better calculate the global maximum by: Each processor calculates maximum inside its node Divide processors into a logical tree and in parallel –Processor 1 Sends its maximum to node 0 –Processor 3 Sends its maximum to node 2 ………. –Processor NTOT-1 Sends its maximum to node NPROC-2 Processors … NPROC-2 find resultant maximums in their nodes –Processor 2 Sends its maximum to node 0 –Processor 6 Sends its maximum to node 4 ………. –Processor NPROC-2 Sends its maximum to node NPROC-4 Repeat this log 2 (NPROC) times This is still correct but total time is proportional to log 2 (NPROC) and scales much better

PC07LaplacePerformance

PC07LaplacePerformance Comments on “Nifty Maximum Algorithm” There is a very large computer science literature on this type of algorithm for finding global quantities optimized for different inter-node communication architectures One uses these for swapping information, broadcasting, global sums as well as maxima Often one does not have the “best” algorithm in installed MPI Note in real world this type of algorithm is used –If University Presidents wants average student grade, she does not ask each student to send their grade and add it up; rather she asks the schools/colleges who ask the departments who ask the courses who do it by the student …. –Similarly in voting you do by voter, polling station, by county and then by state!

PC07LaplacePerformance Structure of Laplace Example We use this example to illustrate some very important general features of parallel programming –Load Imbalance –Communication Cost –SPMD –Guard Rings –Speed up –Ratio of communication and computation time

PC07LaplacePerformance

PC07LaplacePerformance

PC07LaplacePerformance

PC07LaplacePerformance

PC07LaplacePerformance

PC07LaplacePerformance Sequential Guard Rings in Two Dimensions In analogous 2D sequential case, one could dimension array PHI(  ) to PHI(14,14) to hold updated points only. However then points on the edge would need special treatment so that one uses boundary values in update Rather dimension PHI(16,16) to include internal and boundary points Run loops over x(I) and y(J) from 2 to 15 to only cover internal points Preload boundary values in PHI(1,. ), PHI(., 16), PHI(.,1), PHI(.,16) This is easier and faster as no conditionals (IF statements) in inner loops

PC07LaplacePerformance Parallel Guard Rings in Two Dimensions I This is just like one dimensional case First we decompose problem as we have seen Four Processors are shown

PC07LaplacePerformance Parallel Guard Rings in Two Dimensions II Now look at processor in top left It needs real boundary values for updates shown as black and green Then it needs points from neighboring processors shown hatched with green and other processor color

PC07LaplacePerformance Parallel Guard Rings in Two Dimensions III Now we see the effect of all guards with four points at center needed by 3 processors and other shaded points by 2 One dimensions overlapping grids PHI(10,10) here and arranges communication order properly

PC07LaplacePerformance

PC07LaplacePerformance

PC07LaplacePerformance

PC07LaplacePerformance

PC07LaplacePerformance

PC07LaplacePerformance

PC07LaplacePerformance Performance Analysis Parameters This will only depend on 3 parameters n which is grain size -- amount of problem stored on each processor (bounded by local memory) t float which is typical time to do one calculation on one node t comm which is typical time to communicate one word between two nodes Most importance omission here is communication latency Time to communicate = t latency + (Num Words)t comm Node ANode B t comm CPU t float Memory n

PC07LaplacePerformance Analytical analysis of Load Imbalance Consider N by N array of grid points on P Processors where  P is an integer and they are arranged in a  P by  P topology Suppose N is exactly divisible by  P and a general processor has a grain size n = N 2 /P grid points Sequential time T 1 = (N-2) 2 t calc Parallel Time T P = n t calc Speedup S = T 1 /T P = P (1 - 2/N) 2 = P(1 - 2/  (nP) ) 2 S tends to P as N gets large at fixed P This expresses analytically intuitive idea that load imbalance due to boundary effects and will go away for large N

PC07LaplacePerformance Example of Communication Overhead Largest communication load is communicating 16 words to be compared to calculating 16 updates -- each taking time t calc Each communication is one value of  probably stored in a 4 byte word and takes time t comm Then on 16 processors, T 16 = 16t calc + 16t comm Speedup S = T 1 /T 16 = / (1 + t comm /t calc ) or S = / ( t comm /t float ) or S  * ( t comm /t float )

PC07LaplacePerformance Communication Must be Reduced 4 by 4 regions in each processor –16 Green (Compute) and 16 Red (Communicate) Points 8 by 8 regions in each processor –64 Green and “just” 32 Red Points Communication is an edge effect Give each processor plenty of memory and increase region in each machine Large Problems Parallelize Best

PC07LaplacePerformance General Analytical Form of Communication Overhead for Jacobi Consider N grid points in P processors with grain size n = N 2 /P Sequential Time T 1 = 4N 2 t float Parallel Time T P = 4 n t float + 4  n t comm Speed up S = P (1 - 2/N) 2 / (1 + t comm /(  n t float ) ) Both overheads decrease like 1/  n as n increases This ignores communication latency but is otherwise accurate Speed up is reduced from P by both overheads Load Imbalance Communication Overhead

PC07LaplacePerformance General Speed Up and Efficiency Analysis I Efficiency  = Speed Up S / P (Number of Processors) Overhead f comm = (P T P - T 1 ) / T 1 = 1/  - 1 As f comm linear in T P, overhead effects tend to be additive In 2D Jacobi example f comm = t comm /(  n t float ) While efficiency takes approximate form   1 - t comm /(  n t float ) valid when overhead is small As expected efficiency is < 1 corresponding to speedup being < P

PC07LaplacePerformance All systems have various Dimensions

PC07LaplacePerformance General Speed Up and Efficiency Analysis II In many problems there is an elegant formula f comm = constant. t comm /(n 1/d t float ) d is system information dimension which is equal to geometric dimension in problems like Jacobi where communication is a surface and calculation a volume effect –We will see soon case where d is NOT geometric dimension d=1 for Hadrian’s wall and d=2 for Hadrian’s Palace floor while for Jacobi in 1 2 or 3 dimensions, d =1 2 or 3 Note formula only depend on local node and communication parameters and this implies that parallel computing does scale to large P if you build fast enough networks (t comm /t float ) and have a large enough problem (big n)

PC07LaplacePerformance Communication to Calculation Ratio as a function of template I For Jacobi, we have Calculation 4 n t float Communication 4  n t comm Communication Overhead f comm = t comm /(  n t float ) “Smallest” Communication but NOT smallest overhead Update Stencil Communicated Updated Processor Boundaries

PC07LaplacePerformance Communication to Calculation Ratio as a function of template II For Jacobi with fourth order differencing, we have Calculation 9 n t float Communication 8  n t comm Communication Overhead f comm = 0.89 t comm /(  n t float ) A little bit smaller as communication and computation both doubled Update Stencil Communicated Updated Processor Boundaries

PC07LaplacePerformance Communication to Calculation Ratio as a function of template III For Jacobi with diagonal neighbors, we have Calculation 9 n t float Communication 4(  n + 1 ) t comm Communication Overhead f comm = 0.5 t comm /(  n t float ) Quite a bit smaller Update Stencil Communicated Updated Processor Boundaries

PC07LaplacePerformance Communication to Calculation IV Now systematically increase size of stencil. You get this in particle dynamics problems as you increase range of force Calculation per point increases but communication increases faster and f comm decreases systematically Must re-use communicated values for this to work! Update Stencils of increasing range

PC07LaplacePerformance Communication to Calculation V Now make range cover full domain as in long range force problems f comm  t comm /(n t float ) This is a case with geometric dimension 1 2 or 3 (depending on space particles in) but information dimension always 1

PC07LaplacePerformance Butterfly Pattern in 1D DIT FFT DIT is Decimation in Time Data dependencies in 1D FFT show a characteristic butterfly structure We show 4 processing phases p for a 16 point FFT At phase p for DIT, one manipulates p’th digit of f(m) Index m with binary labels Phase

PC07LaplacePerformance Phase Parallelism in 1D FFT Consider 4 Processors and natural block decomposition log 2 N Phases with first log 2 N proc of phases having communication Here phases 2 and 3 have communication as dependency lines cross processor boundaries There is a better algorithm that transposes to do first log 2 N proc phases Processor Boundary Processor Boundary Processor Boundary Processor Number

PC07LaplacePerformance Parallel Fast Fourier Transform We have the standard formula f comm = constant. t comm /(n 1/d t float ) The FFT does not have the usual local interaction but butterfly pattern so n 1/d becomes log 2 n corresponding to d = infinity Below T comm is time to exchange one complex word (in a block transfer!) between pairs of processors P = log 2 N proc N = Total number of points k = log 2 N FFT k k ( 2T + +T * ) + ( k

PC07LaplacePerformance Better Algorithm If we change algorithm to move “out of processor” bits to be in processor, do their transformatio, and then move those “digits” back An interesting point about the resultant communication overhead is that one no longer gets the log 2 N proc dependence in the numerator See npac.org/users/fox/presentations/cps615fft00/cps615fft00.PPT npac.org/users/fox/presentations/cps615fft00/cps615fft00.PPT

PC07LaplacePerformance = f comm

PC07LaplacePerformance

PC07LaplacePerformance

PC07LaplacePerformance