2016/1/6Part I1 A Taste of Parallel Algorithms. 2016/1/6Part I2 We examine five simple building-block parallel operations and look at the corresponding.

Slides:



Advertisements
Similar presentations
Basic Communication Operations
Advertisements

Lecture 9: Group Communication Operations
Exact Inference. Inference Basic task for inference: – Compute a posterior distribution for some query variables given some observed evidence – Sum out.
Junction Trees And Belief Propagation. Junction Trees: Motivation What if we want to compute all marginals, not just one? Doing variable elimination for.
CIS December '99 Introduction to Parallel Architectures Dr. Laurence Boxer Niagara University.
1 Wide-Sense Nonblocking Multicast in a Class of Regular Optical Networks From: C. Zhou and Y. Yang, IEEE Transactions on communications, vol. 50, No.
1 Parallel Algorithms II Topics: matrix and graph algorithms.
Junction Trees: Motivation Standard algorithms (e.g., variable elimination) are inefficient if the undirected graph underlying the Bayes Net contains cycles.
Parallel Architectures: Topologies Heiko Schröder, 2003.
Advanced Topics in Algorithms and Data Structures 1 Rooting a tree For doing any tree computation, we need to know the parent p ( v ) for each node v.
Parallel Architectures: Topologies Heiko Schröder, 2003.
1 Complexity of Network Synchronization Raeda Naamnieh.
1 Lecture 8 Architecture Independent (MPI) Algorithm Design Parallel Computing Fall 2007.
CSE621/JKim Lec4.1 9/20/99 CSE621 Parallel Algorithms Lecture 4 Matrix Operation September 20, 1999.
Lecture 21: Parallel Algorithms
Parallel Routing Bruce, Chiu-Wing Sham. Overview Background Routing in parallel computers Routing in hypercube network –Bit-fixing routing algorithm –Randomized.
1 Tuesday, October 03, 2006 If I have seen further, it is by standing on the shoulders of giants. -Isaac Newton.
Communication operations Efficient Parallel Algorithms COMP308.
CS 206 Introduction to Computer Science II 12 / 03 / 2008 Instructor: Michael Eckmann.
Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR.
CS 684.
1 B-Trees Disk Storage What is a multiway tree? What is a B-tree? Why B-trees? Comparing B-trees and AVL-trees Searching a B-tree Insertion in a B-tree.
CS 206 Introduction to Computer Science II 11 / 09 / 2009 Instructor: Michael Eckmann.
Chapter 5, CLR Textbook Algorithms on Grids of Processors.
Topic Overview One-to-All Broadcast and All-to-One Reduction
The Euler-tour technique
Design of parallel algorithms Matrix operations J. Porras.
Delivery, Forwarding, and Routing
Interconnection Networks. Applications of Interconnection Nets Interconnection networks are used everywhere! ◦ Supercomputers – connecting the processors.
CHAPTER 71 TREE. Binary Tree A binary tree T is a finite set of one or more nodes such that: (a) T is empty or (b) There is a specially designated node.
Broadcast & Convergecast Downcast & Upcast
Tree.
Synchronization (Barriers) Parallel Processing (CS453)
Basic Communication Operations Based on Chapter 4 of Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis and Vipin Kumar These.
Chapter 19: Binary Trees. Objectives In this chapter, you will: – Learn about binary trees – Explore various binary tree traversal algorithms – Organize.
A Taste of Parallel Algorithms A.Broumandnia, 1 Learn about the nature of parallel algorithms and complexity: By implementing 5 building-block.
Topic Overview One-to-All Broadcast and All-to-One Reduction All-to-All Broadcast and Reduction All-Reduce and Prefix-Sum Operations Scatter and Gather.
Lecture 10 Trees –Definiton of trees –Uses of trees –Operations on a tree.
Parallel and Distributed Algorithms Eric Vidal Reference: R. Johnsonbaugh and M. Schaefer, Algorithms (International Edition) Pearson Education.
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Lecture 3 Innerconnection Networks for Parallel Computers
Chapter 21 Binary Heap.
Winter 2014Parallel Processing, Mesh-Based ArchitecturesSlide 1 12 Mesh-Related Architectures Study variants of simple mesh and torus architectures: Variants.
CSED101 INTRODUCTION TO COMPUTING TREE 2 Hwanjo Yu.
Preview  Graph  Tree Binary Tree Binary Search Tree Binary Search Tree Property Binary Search Tree functions  In-order walk  Pre-order walk  Post-order.
Data Structures Types of Data Structure Data Structure Operations Examples Choosing Data Structures Data Structures in Alice.
CS 206 Introduction to Computer Science II 04 / 22 / 2009 Instructor: Michael Eckmann.
Lecture 11COMPSCI.220.FS.T Balancing an AVLTree Two mirror-symmetric pairs of cases to rebalance the tree if after the insertion of a new key to.
Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Reduced slides for CSCE 3030 To accompany the text ``Introduction.
1 Interconnection Networks. 2 Interconnection Networks Interconnection Network (for SIMD/MIMD) can be used for internal connections among: Processors,
Discrete Mathematics Chapter 5 Trees.
Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 8 October 23, 2002 Nayda G. Santiago.
CSCI-455/552 Introduction to High Performance Computing Lecture 23.
Super computers Parallel Processing
Data Structures: A Pseudocode Approach with C, Second Edition 1 Chapter 7 Objectives Create and implement binary search trees Understand the operation.
1 Ch.19 Divide and Conquer. 2 BIRD’S-EYE VIEW Divide and conquer algorithms Decompose a problem instance into several smaller independent instances May.
HYPERCUBE ALGORITHMS-1
Graphs and Trees Mathematical Structures for Computer Science Chapter 5 Copyright © 2006 W.H. Freeman & Co.MSCS SlidesGraphs and Trees.
Basic Communication Operations Carl Tropper Department of Computer Science.
Data Structure and Algorithms
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
A Taste of Parallel Algorithms A.Broumandnia, 1 Learn about the nature of parallel algorithms and complexity: By implementing 5 building-block.
Distributed-Memory or Graph Models
2016/7/2Appendices A and B1 Introduction to Distributed Algorithm Appendix A: Pseudocode Conventions Appendix B: Graphs and Networks Teacher: Chun-Yuan.
Chapter 15: Distributed Structures
Communication operations
High Performance Computing & Bioinformatics Part 2 Dr. Imad Mahgoub
Presentation transcript:

2016/1/6Part I1 A Taste of Parallel Algorithms

2016/1/6Part I2 We examine five simple building-block parallel operations and look at the corresponding algorithms on four simple parallel architectures: linear array, binary tree, 2D mesh, and a simple shared­variable computer.

2016/1/6Part I3 Semigroup Computation

2016/1/6Part I4 Parallel Prefix Computation

2016/1/6Part I5 Packet Routing A packet of information resides at Processor i and must be sent to Processor j. The problem is to route the packet through intermediate processors, if needed, such that it gets to the destination as quickly as possible. The problem becomes more challenging when multiple packets reside at different processors, each with its own destination. When each processor has at most one packet to send and one packet to receive, the packet routing problem is called one-to- one communication or 1-1 routing.

2016/1/6Part I6 Broadcasting Given a value a known at a certain processor i, disseminate it to all p processors as quickly as possible, so that at the end, every processor has access to, or "knows," the value. This is sometimes referred to as one-to-all communication. one-to-many communication, is known as multicasting.

2016/1/6Part I7 Sorting Rather than sorting a set of records, each with a key and data elements, we focus on sorting a set of keys for simplicity.

2016/1/6Part I8 Linear Array D=p-1 d=2 Ring?

2016/1/6Part I9 Binary Tree If all leaf levels are identical and every nonleaf processor has two children, the binary tree is said to be complete. D= d=3

2016/1/6Part I10 2D Mesh D= d=4 Torus?

2016/1/6Part I11 Shared memory A shared-memory multiprocessor can be modeled as a complete graph, in which every node is connected to every other node. D=1 d=p-1

2016/1/6Part I12 Algorithms for a Linear Array (1) Semigroup Computation –Let us consider first a special case of semigroup computation, namely, that of maximum finding. Each of the p processors holds a value initially and our goal is for every processor to know the largest of these values.

2016/1/6Part I13 Algorithms for a Linear Array (2) Parallel Prefix Computation (Case1)

2016/1/6Part I14 Algorithms for a Linear Array (3) Parallel Prefix Computation (Case2, more than one value)

2016/1/6Part I15 Algorithms for a Linear Array (4) Packet Routing

2016/1/6Part I16 Algorithms for a Linear Array (5) Broadcasting –If Processor i wants to broadcast a value a to all processors, it sends an rbcast(a) (read r-broadcast) message to its right neighbor and an lbcast(a) message to its left neighbor.

2016/1/6Part I17 Algorithms for a Linear Array (6) Sorting (Case 1)

2016/1/6Part I18 Algorithms for a Linear Array (7) Sorting (Case 2, odd-even transposition) (efficiency?)

2016/1/6Part I19 Algorithms for a Binary Tree (1) In algorithms for a binary tree of processors, we will assume that the data elements are initially held by the leaf processors only. The nonleaf (inner) processors participate in the computation, but do not hold data elements of their own.

2016/1/6Part I20 Algorithms for a Binary Tree (2) Semigroup Computation –Each inner node receives two values from its children, applies the operator to them, and passes the result upward to its parent.

2016/1/6Part I21 Algorithms for a Binary Tree (3) Parallel Prefix Computation

2016/1/6Part I22 Algorithms for a Binary Tree (4) Packet Routing –depends on the processor numbering scheme used. –Preorder

2016/1/6Part I23 Algorithms for a Binary Tree (5) Broadcasting –Processor i sends the desired data upwards to the root processor, which then broadcasts the data downwards to all processors.

2016/1/6Part I24 Algorithms for a Binary Tree (6) Sorting

2016/1/6Part I25 Algorithms for 2D Mesh (1) In all of the 2D mesh algorithms presented in this section, we use the linear-array algorithms of Section 2.3 as building blocks. This leads to simple algorithms, but not necessarily the most efficient ones. Mesh-based architectures and their algorithms will be discussed in great detail in Part III.

2016/1/6Part I26 Algorithms for 2D Mesh (2) Semigroup Computation –For example, in finding the maximum of a set of p values, stored one per processor, the row maximums are computed first and made available to every processor in the row. Then column maximums are identified.

2016/1/6Part I27 Algorithms for 2D Mesh (3) Parallel Prefix Computation –(1) do a parallel prefix computation on each row, –(2) do a diminished parallel prefix computation in the rightmost column, and –(3) broadcast the results in the rightmost column to all of the elements in the respective rows and combine with the initially computed row prefix value.

2016/1/6Part I28 Algorithms for 2D Mesh (4) Packet Routing –To route a data packet from the processor in Row r, Column c, to the processor in Row r', Column c', we first route it within Row r to Column c'. Then, we route it in Column c' from Row r to Row r'. (row-first routing)

2016/1/6Part I29 Algorithms for 2D Mesh (5) Broadcasting –(1) broadcast the packet to every processor in the source node's row and –(2) broadcast in all columns.

2016/1/6Part I30 Algorithms for 2D Mesh (6) Sorting

2016/1/6Part I31 Algorithms for Shared Variables Semigroup Computation Parallel Prefix computation Packet Routing (Trivial in view of the direct communication path between any pair of processors) Broadcasting (Trivial, as each processor can send a data item to all processors directly) Sorting