12e.1 More on Parallel Computing UNC-Wilmington, C. Ferner, 2007 Mar 21, 2007.

Slides:

Advertisements

Similar presentations

Parallel Jacobi Algorithm Steven Dong Applied Mathematics.

Advertisements

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

Block LU Factorization Lecture 24 MA471 Fall 2003.

Sharks and Fishes – The problem

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,

Dense Matrix Algorithms. Topic Overview Matrix-Vector Multiplication Matrix-Matrix Multiplication Solving a System of Linear Equations.

Starting Parallel Algorithm Design David Monismith Based on notes from Introduction to Parallel Programming 2 nd Edition by Grama, Gupta, Karypis, and.

Permutations and Combinations

Numerical Algorithms Matrix multiplication

LOCALITY IN DISTRIBUTED GRAPH ALGORITHMS Nathan Linial Presented by: Ron Ryvchin.

MPI Program Structure Self Test with solution. Self Test 1.How would you modify "Hello World" so that only even-numbered processors print the greeting.

11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.

1 Threads CSCE 351: Operating System Kernels Witawas Srisa-an Chapter 4-5.

Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.

1 Tuesday, November 14, 2006 “UNIX was never designed to keep people from doing stupid things, because that policy would also keep them from doing clever.

12a.1 Introduction to Parallel Computing UNC-Wilmington, C. Ferner, 2008 Nov 4, 2008.

1 Lecture 8 Architecture Independent (MPI) Algorithm Design Parallel Computing Fall 2007.

12c.1 Collective Communication in MPI UNC-Wilmington, C. Ferner, 2008 Nov 4, 2008.

A Row-Permutated Data Reorganization Algorithm for Growing Server-less VoD Systems Presented by Ho Tsz Kin.

12d.1 Two Example Parallel Programs using MPI UNC-Wilmington, C. Ferner, 2007 Mar 209, 2007.

Parallel Computation in Biological Sequence Analysis Xue Wu CMSC 838 Presentation.

Given UPC algorithm – Cyclic Distribution Simple algorithm does cyclic distribution This means that data is not local unless item weight is a multiple.

Improved results for a memory allocation problem Rob van Stee University of Karlsruhe Germany Leah Epstein University of Haifa Israel WADS 2007 WAOA 2007.

By Justin Thompson. What is SOAP? Originally stood for Simple Object Access Protocol Created by vendors from Microsoft, Lotus, IBM, and others Protocol.

INTEL CONFIDENTIAL Reducing Parallel Overhead Introduction to Parallel Programming – Part 12.

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

Lecture 10 CSS314 Parallel Computing

1 Process States (1) Possible process states –running –blocked –ready Transitions between states shown.

CS 584 l Assignment. Systems of Linear Equations l A linear equation in n variables has the form l A set of linear equations is called a system. l A solution.

1 Example 6 Find the area A between the parabola x = y 2 and the line y = x-2. Solution The first step is to determine the intersection points of this.

CS453 Lecture 3.  A sequential algorithm is evaluated by its runtime (in general, asymptotic runtime as a function of input size).  The asymptotic runtime.

File Organization and Processing Week 13 Divide and Conquer.

Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,

Threaded Programming Lecture 4: Work sharing directives.

Communication and Computation on Arrays with Reconfigurable Optical Buses Yi Pan, Ph.D. IEEE Computer Society Distinguished Visitors Program Speaker Department.

CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.

CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.

Dense Linear Algebra Sathish Vadhiyar. Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding multiples of.

1 Parallel Vector & Signal Processing Mark Mitchell CodeSourcery, LLC April 3, 2003.

Compiling Fortran D For MIMD Distributed Machines Authors: Seema Hiranandani, Ken Kennedy, Chau-Wen Tseng Published: 1992 Presented by: Sunjeev Sikand.

08/10/ NRL Hybrid QR Factorization Algorithm for High Performance Computing Architectures Peter Vouras Naval Research Laboratory Radar Division Professor.

Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Nov 3, 2005.

Memory-Aware Scheduling for LU in Charm++ Isaac Dooley, Chao Mei, Jonathan Lifflander, Laxmikant V. Kale.

Computer Science 320 Load Balancing. Behavior of Parallel Program Why do 3 threads take longer than two?

ECE 1747H: Parallel Programming Lecture 2-3: More on parallelism and dependences -- synchronization.

Lecture 3: Designing Parallel Programs. Methodological Design Designing and Building Parallel Programs by Ian Foster www-unix.mcs.anl.gov/dbpp.

Case Study 5: Molecular Dynamics (MD) Simulation of a set of bodies under the influence of physical laws. Atoms, molecules, forces acting on them... Have.

Concurrency and Performance Based on slides by Henri Casanova.

CSCI-455/552 Introduction to High Performance Computing Lecture 15.

COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University

COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University

ITCS 4/5145 Parallel Computing, UNC-Charlotte, B

Andrey Markov, public domain image

Auburn University

Ioannis E. Venetis Department of Computer Engineering and Informatics

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.

Introduction to OpenMP

Parallel Algorithm Design

Introduction to Parallel Programming

Using compiler-directed approach to create MPI code automatically

Paraguin Compiler Communication.

Paraguin Compiler Version 2.1.

Paraguin Compiler Version 2.1.

Numerical Algorithms Quiz questions

Parallelization of CPAIMD using Charm++

Introduction to OpenMP

Solving for x and y when you have two equations

To accompany the text “Introduction to Parallel Computing”,

CS 584 Lecture 5 Assignment. Due NOW!!.

Presentation transcript:

12e.1 More on Parallel Computing UNC-Wilmington, C. Ferner, 2007 Mar 21, 2007

12e.2 Block Mapping (Review) blksz = (int)ceil((float)N / P); for (i = lb + my_rank * blksz; i < min(N, lb + (my_rank + 1) * blksz); i++) {... } (lb is the lower bound of the original loop)

12e.3 Example for (i = 1; i < N; i++) { for (j = 0; j < N; j++) { a[i][j] += f(a[i-1][j]); }

12e.4 Example 0,00,10,20,3 0,N ,0 1,1 1,2 1,31,N ,02,12,22,3 2,N-1... N-1,0N-1,1N-1,2N-1,3N-1,N-1... j i

12e.5 Example If we mapped iterations of the i loop to processors, the dependencies cross processors boundaries Therefore interprocessor communication would be required

12e.6 N-1,N-1 Example 0,00,10,20,3 0,N ,0 1,1 1,2 1,31,N ,02,12,22,3 2,N-1... N-1,0N-1,1N-1,2N-1,3... PE 0 : PE 1 : PE 2 : PE P :

12e.7 Example A better solution would be to map iterations of the j loop to processors

12e.8 N-1,N-1 Example 0,00,10,20,3 0,N ,0 1,1 1,2 1,31,N ,02,12,22,3 2,N-1... N-1,0N-1,1N-1,2N-1,3... PE 0 : PE 1 : PE 2 : PE 3 :

12e.9 Example for (i = 1; i < N; i++) { for (j = my_rank * blksz; i < min(N, (my_rank + 1) * blksz); i++) { a[i][j] += f(a[i-1][j]); }

12e.10 Block Mapping (Review) blksz = (int)ceil((float)N / P); for (i = lb + my_rank * blksz; i < min(N, lb + (my_rank + 1) * blksz); i++) {... } (lb is the lower bound of the original loop)

12e.11 Block Mapping

12e.12 Block Mapping The problem is that block mapping can lead to a load imbalance Example, let N=26, P=6 blksz = ceiling(26/6) = 5 (lb = 0)

12e.13 Block Mapping Processors 0-4 have 5 iterations of work Processor 5 has 1 iteration

12e.14 Cyclic Mapping An alternative to block mapping is cyclic mapping This is where each iteration is assigned to each processors in a round robin fashion This leads to a better load balance

12e.15 Cyclic Mapping Processors 0-2 have 6 iterations of work Processor 3-6 have only 5, but it is only 1 iteration fewer!

12e.16 Cyclic Mapping for (i = lb + my_rank; i < N; i += P) {... } (lb is the lower bound of the original loop)

12e.17 Cyclic Mapping Conceptually, this is an easier mapping to implement than block mapping It leads to better load balancing However, it can (and often does) lead to more communication Suppose that each iteration in the above example is dependent on the previous iteration

12e.18 Cyclic Mapping A message is sent from iteration 0 to 1, from 1 to 2, from 2 to 3, from 3 to 4, from 4 to 5, from 5 to 6,...

12e.19 Block Mapping With block mapping, only messages are sent from iteration 5 to 6, from 11 to 12, from 17 to 18, and from 23 to 24

12e.20 Block vs Cyclic Block mapping increases the granularity and reduces overall communication (O(P)). However, it can lead to load imbalances (O(N/P)). Cyclic mapping decreases granularity and increases overall communication (O(N)). However, it improves load balance (O(1)). Block-Cyclic is a combination of the two

12e.21 Block-Cyclic Mapping Block-cyclic with N=26, P=6, and blksz=2 The load imbalance will be <= blksz

12e.22 Block-Cyclic Mapping (N, P, and blksz are given) nLayers = (int)ceil(((float)N)/(blksz*P)); for (layer = 0; layer < nLayers; layer++) { beginBlk = layer*blksz*N; for (i = beginBlk + mypid*blksz; i < min(N, beginBlk + (mypid + 1)*blksz); i++) {... }

12e.23 Block vs Cyclic Block-Cyclic is in between Block and Cyclic in terms of granularity, communication, and load balancing. Block and Cyclic are special cases of Block-Cyclic –Block = Block-Cyclic with blksz = ceiling(N/P) –Cyclic = Block-Cyclic with blksz = 1