Lecture 6 Objectives Communication Complexity Analysis Collective Operations –Reduction –Binomial Trees –Gather and Scatter Operations Review Communication.

Slides:



Advertisements
Similar presentations
Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
Advertisements

Parallel Jacobi Algorithm Steven Dong Applied Mathematics.
2 Less fish … More fish! Parallelism means doing multiple things at the same time: you can get more work done in the same time.
Parallel Programming in C with MPI and OpenMP
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Chapter 6 Floyd’s Algorithm. 2 Chapter Objectives Creating 2-D arrays Thinking about “grain size” Introducing point-to-point communications Reading and.
1 Friday, September 29, 2006 If all you have is a hammer, then everything looks like a nail. -Anonymous.
1 Lecture 8 Architecture Independent (MPI) Algorithm Design Parallel Computing Fall 2007.
Parallel MIMD Algorithm Design
1 Tuesday, October 03, 2006 If I have seen further, it is by standing on the shoulders of giants. -Isaac Newton.
Chapter 6 Floyd’s Algorithm. 2 Chapter Objectives Creating 2-D arrays Thinking about “grain size” Introducing point-to-point communications Reading and.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 8 Matrix-vector Multiplication.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
1/26 Design of parallel algorithms Linear equations Jari Porras.
CS 584. Algorithm Analysis Assumptions n Consider ring, mesh, and hypercube. n Each process can either send or receive a single message at a time. n No.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Topic Overview One-to-All Broadcast and All-to-One Reduction
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Design of parallel algorithms Matrix operations J. Porras.
Dense Matrix Algorithms CS 524 – High-Performance Computing.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Chapter 13 Finite Difference Methods: Outline Solving ordinary and partial differential equations Finite difference methods (FDM) vs Finite Element Methods.
Today Objectives Chapter 6 of Quinn Creating 2-D arrays Thinking about “grain size” Introducing point-to-point communications Reading and printing 2-D.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
2a.1 Evaluating Parallel Programs Cluster Computing, UNC-Charlotte, B. Wilkinson.
Collective Communication
Collective Communication on Architectures that Support Simultaneous Communication over Multiple Links Ernie Chan.
Synchronous Algorithms I Barrier Synchronizations and Computing LBTS.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Basic Communication Operations Based on Chapter 4 of Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis and Vipin Kumar These.
Computer Science 320 Broadcasting. Floyd’s Algorithm on SMP for i = 0 to n – 1 parallel for r = 0 to n – 1 for c = 0 to n – 1 d rc = min(d rc, d ri +
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
CS 584 l Assignment. Systems of Linear Equations l A linear equation in n variables has the form l A set of linear equations is called a system. l A solution.
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
MPI Communications Point to Point Collective Communication Data Packaging.
LogP Model Motivation BSP Model Limited to BW of Network (g) and Load of PE Requires large load per super steps. Need Better Models for Portable Algorithms.
Graph Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Adapted for 3030 To accompany the text ``Introduction to Parallel Computing'',
10/02/2012CS4230 CS4230 Parallel Programming Lecture 11: Breaking Dependences and Task Parallel Algorithms Mary Hall October 2,
Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Message Passing and MPI Laxmikant Kale CS Message Passing Program consists of independent processes, –Each running in its own address space –Processors.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Thinking in Parallel – Implementing In Code New Mexico Supercomputing Challenge in partnership with Intel Corp. and NM EPSCoR.
CS 484. Iterative Methods n Gaussian elimination is considered to be a direct method to solve a system. n An indirect method produces a sequence of values.
Graph Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'', Addison Wesley,
FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture FIT5174 Distributed & Parallel Systems Lecture 5 Message Passing and MPI.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 8 October 23, 2002 Nayda G. Santiago.
MPI implementation – collective communication MPI_Bcast implementation.
Data Structures and Algorithms in Parallel Computing Lecture 10.
Lecture 9 Architecture Independent (MPI) Algorithm Design
Basic Communication Operations Carl Tropper Department of Computer Science.
Lecture 3: Today’s topics MPI Broadcast (Quinn Chapter 5) –Sieve of Eratosthenes MPI Send and Receive calls (Quinn Chapter 6) –Floyd’s algorithm Other.
CSCI-455/552 Introduction to High Performance Computing Lecture 15.
Chapter 5: Looping. Using the while Loop Loop – A structure that allows repeated execution of a block of statements Loop body – A block of statements.
ITCS 4/5145 Parallel Computing, UNC-Charlotte, B
High Altitude Low Opening?
Parallel Programming in C with MPI and OpenMP
Parallel computation models
Parallel Graph Algorithms
Parallel Programming with MPI and OpenMP
CSCE569 Parallel Computing
CSCE569 Parallel Computing
Parallel Programming in C with MPI and OpenMP
CSCE569 Parallel Computing
CSCE569 Parallel Computing
Parallel Graph Algorithms
Parallel Programming in C with MPI and OpenMP
Presentation transcript:

Lecture 6 Objectives Communication Complexity Analysis Collective Operations –Reduction –Binomial Trees –Gather and Scatter Operations Review Communication Analysis of Floyd’s Algorithm

Parallel Reduction Evolution

Binomial Trees Subgraph of hypercube

Finding Global Sum

Finding Global Sum

Finding Global Sum

Finding Global Sum 178

Finding Global Sum 25 Binomial Tree

Agglomeration

sum

Gather

All-gather

Complete Graph for All-gather

Hypercube for All-gather

Analysis of Communication Lambda is latency = message delay = overhead to send 1 message Beta is bandwidth = number of data items per unit time = bytes per message Sending a message with n data items costs

Communication Time for All-Gather Hypercube Complete graph

Adding Data Input

Scatter

Scatter in log p Steps

Communication Time for Scatter Hypercube Complete graph

Recall Parallel Floyd’s Computational Complexity Innermost loop has complexity  (n) Middle loop executed at most  n/p  times Outer loop executed n times Overall computation complexity  (n 3 /p)

Floyd’s Communication Complexity No communication in inner loop No communication in middle loop Broadcast in outer loop — complexity is Executed n times

Execution Time Expression (1) Iterations of outer loop Iterations of middle loop Cell update time Iterations of outer loop Messages per broadcast Message-passing time bytes/msg Iterations of inner loop

Accounting for Computation/communication Overlap Note that after the 1 st broadcast all the wait times overlap the computation time of Process 0.

Execution Time Expression (2) Iterations of outer loop Iterations of middle loop Cell update time Iterations of outer loop Messages per broadcast Message-passing time Iterations of inner loop Message transmission

Predicted vs. Actual Performance Execution Time (sec) ProcessesPredictedActual X=25.5 nsec L = 250 usecs B = 10MB/sec N = 1000

Summary Two matrix decompositions –Rowwise block striped –Columnwise block striped Blocking send/receive functions –MPI_Send –MPI_Recv Overlapping communications with computations