CS 584. Algorithm Analysis Assumptions n Consider ring, mesh, and hypercube. n Each process can either send or receive a single message at a time. n No.

Slides:



Advertisements
Similar presentations
Shantanu Dutt Univ. of Illinois at Chicago
Advertisements

Basic Communication Operations
Lecture 9: Group Communication Operations
Parallel Processing1 Parallel Processing (CS 667) Lecture 9: Advanced Point to Point Communication Jeremy R. Johnson *Parts of this lecture was derived.
Parallel Sorting Sathish Vadhiyar. Sorting  Sorting n keys over p processors  Sort and move the keys to the appropriate processor so that every key.
1 Collective Operations Dr. Stephen Tse Lesson 12.
Efficient Realization of Hypercube Algorithms on Optical Arrays* Hong Shen Department of Computing & Maths Manchester Metropolitan University, UK ( Joint.
CIS December '99 Introduction to Parallel Architectures Dr. Laurence Boxer Niagara University.
Parallel Architectures: Topologies Heiko Schröder, 2003.
Advanced Topics in Algorithms and Data Structures An overview of the lecture 2 Models of parallel computation Characteristics of SIMD models Design issue.
Interconnection Networks 1 Interconnection Networks (Chapter 6) References: [1,Wilkenson and Allyn, Ch. 1] [2, Akl, Chapter 2] [3, Quinn, Chapter 2-3]
1 CSE 591-S04 (lect 14) Interconnection Networks (notes by Ken Ryu of Arizona State) l Measure –How quickly it can deliver how much of what’s needed to.
Lecture 6 Objectives Communication Complexity Analysis Collective Operations –Reduction –Binomial Trees –Gather and Scatter Operations Review Communication.
1 Lecture 8 Architecture Independent (MPI) Algorithm Design Parallel Computing Fall 2007.
ICN’s The n-D hypercube (n-cube) contains 2^n nodes (processors).
1 Tuesday, October 03, 2006 If I have seen further, it is by standing on the shoulders of giants. -Isaac Newton.
Communication operations Efficient Parallel Algorithms COMP308.
Broadcast and scatter algorithms on mesh- based topologies.
Parallel Prefix and Data Parallel Operations Motivation: basic parallel operations which occurs repeatedly. Let ) be an associative operation. (a 1 ) a.
CS 584. Sorting n One of the most common operations n Definition: –Arrange an unordered collection of elements into a monotonically increasing or decreasing.
CS 684.
Models of Parallel Computation Advanced Algorithms & Data Structures Lecture Theme 12 Prof. Dr. Th. Ottmann Summer Semester 2006.
Interconnection Network Topologies
CS 584. Sorting n One of the most common operations n Definition: –Arrange an unordered collection of elements into a monotonically increasing or decreasing.
Examples Broadcasting and Gossiping. Broadcast on ring (Safe and Forward)
Topic Overview One-to-All Broadcast and All-to-One Reduction
Design of parallel algorithms Matrix operations J. Porras.
Parallel Programming – Process- Based Communication Operations David Monismith CS599 Based upon notes from Introduction to Parallel Programming, Second.
Interconnection Networks. Applications of Interconnection Nets Interconnection networks are used everywhere! ◦ Supercomputers – connecting the processors.
Network Topologies Topology – how nodes are connected – where there is a wire between 2 nodes. Routing – the path a message takes to get from one node.
Collective Communication
Collective Communication on Architectures that Support Simultaneous Communication over Multiple Links Ernie Chan.
CS668- Lecture 2 - Sept. 30 Today’s topics Parallel Architectures (Chapter 2) Memory Hierarchy Busses and Switched Networks Interconnection Network Topologies.
Basic Communication Operations Based on Chapter 4 of Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis and Vipin Kumar These.
Parallel Programming and Algorithms – MPI Collective Operations David Monismith CS599 Feb. 10, 2015 Based upon MPI: A Message-Passing Interface Standard.
Dynamic Interconnect Lecture 5. COEN Multistage Network--Omega Network Motivation: simulate crossbar network but with fewer links Components: –N.
CS 584 l Assignment. Systems of Linear Equations l A linear equation in n variables has the form l A set of linear equations is called a system. l A solution.
Topic Overview One-to-All Broadcast and All-to-One Reduction All-to-All Broadcast and Reduction All-Reduce and Prefix-Sum Operations Scatter and Gather.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Matrix Multiplication Instructor : Dr Sushil K Prasad Presented By : R. Jayampathi Sampath Instructor : Dr Sushil K Prasad Presented By : R. Jayampathi.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February 10, 2005 Session 9.
Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Reduced slides for CSCE 3030 To accompany the text ``Introduction.
1 Interconnection Networks. 2 Interconnection Networks Interconnection Network (for SIMD/MIMD) can be used for internal connections among: Processors,
Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:
2016/1/6Part I1 A Taste of Parallel Algorithms. 2016/1/6Part I2 We examine five simple building-block parallel operations and look at the corresponding.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 8 October 23, 2002 Nayda G. Santiago.
CSCI-455/552 Introduction to High Performance Computing Lecture 23.
Super computers Parallel Processing
On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de.
HYPERCUBE ALGORITHMS-1
Lecture 9 Architecture Independent (MPI) Algorithm Design
Basic Communication Operations Carl Tropper Department of Computer Science.
Minimum-Cost Spanning Tree weighted connected undirected graph cost of spanning tree is sum of edge costs find spanning tree that has minimum cost Network.
Interconnection Networks Communications Among Processors.
CSc 8530 Matrix Multiplication and Transpose By Jaman Bhola.
Distributed-Memory or Graph Models
Collective Communication Implementations
Distributed and Parallel Processing
Collective Communication Implementations
Computer Network Topology
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Interconnection Networks (Part 2) Dr.
Interconnection topologies
Computer Architecture Introduction to Data-Parallel architectures
Collective Communication Implementations
Decomposition Data Decomposition Functional Decomposition
Communication operations
High Performance Computing & Bioinformatics Part 2 Dr. Imad Mahgoub
Parallel Sorting Algorithms
CS 584 Lecture 5 Assignment. Due NOW!!.
Presentation transcript:

CS 584

Algorithm Analysis Assumptions n Consider ring, mesh, and hypercube. n Each process can either send or receive a single message at a time. n No special communication hardware. n When discussing a mesh architecture we will consider a square toroidal mesh. n Latency is t s and Bandwidth is t w

Basic Algorithms n Broadcast Algorithms u one to all (scatter) u all to one (gather) u all to all n Reduction u all to one u all to all

Broadcast (ring) n Distribute a message of size m to all nodes. source

Broadcast (ring) n Distribute a message of size m to all nodes. n Start the message both ways source T = (t s + t w m)(p/2)

Broadcast (mesh)

Broadcast to source row using ring algorithm

Broadcast (mesh) Broadcast to source row using ring algorithm Broadcast to the rest using ring algorithm from the source row

Broadcast (mesh) Broadcast to source row using ring algorithm Broadcast to the rest using ring algorithm from the source row T = 2(t s + t w m)(p 1/2 /2)

Broadcast (hypercube)

A message is sent along each dimension of the hypercube. Parallelism grows as a binary tree

Broadcast (hypercube) A message is sent along each dimension of the hypercube. Parallelism grows as a binary tree T = (t s + t w m)log 2 p

Broadcast n Mesh algorithm was based on embedding rings in the mesh. n Can we do better on the mesh? n Can we embed a tree in a mesh? u Exercise for the reader. (-: hint, hint ;-)

Other Broadcasts n Many algorithms for all-to-one and all-to-all communication are simply reversals and duals of the one-to-all broadcast. n Examples u All-to-one F Reverse the algorithm and concatenate u All-to-all F Butterfly and concatenate

Reduction Algorithms n Reduce or combine a set of values on each processor to a single set. u Summation u Max/Min n Many reduction algorithms simply use the all-to-one broadcast algorithm. u Operation is performed at each node.

Reduction n If the goal is to have only one processor with the answer, use broadcast algorithms. n If all must know, use butterfly. u Reduces algorithm from 2log p to log p

How'd they do that? n Broadcast and Reduction algorithms are based on Gray code numbering of nodes. n Consider a hypercube Neighboring nodes differ by only one bit location

How'd they do that? n Start with most significant bit. n Flip the bit and send to that processor n Proceed with the next most significant bit n Continue until all bits have been used.

Procedure SingleNodeAccum(d, my_id, m, X, sum) for j = 0 to m-1 sum[j] = X[j]; mask = 0 for i = 0 to d-1 if ((my_id AND mask) == 0) if ((my_id AND 2 i ) <> 0 msg_dest = my_id XOR 2 i send(sum, msg_dest) else msg_src = my_id XOR 2 i recv(sum, msg_src) for j = 0 to m-1 sum[j] += X[j] endif mask = mask XOR 2 i endfor end