Distributed-Memory or Graph Models

Slides:



Advertisements
Similar presentations
Basic Communication Operations
Advertisements

Lecture 9: Group Communication Operations
Parallel Algorithms and Computing Selected topics Parallel Architecture.
Parallel Sorting Sathish Vadhiyar. Sorting  Sorting n keys over p processors  Sort and move the keys to the appropriate processor so that every key.
CIS December '99 Introduction to Parallel Architectures Dr. Laurence Boxer Niagara University.
Parallel Architectures: Topologies Heiko Schröder, 2003.
Advanced Topics in Algorithms and Data Structures 1 Rooting a tree For doing any tree computation, we need to know the parent p ( v ) for each node v.
Parallel Prefix Computation Advanced Algorithms & Data Structures Lecture Theme 14 Prof. Dr. Th. Ottmann Summer Semester 2006.
Parallel Architectures: Topologies Heiko Schröder, 2003.
1 Interconnection Networks Direct Indirect Shared Memory Distributed Memory (Message passing)
Advanced Topics in Algorithms and Data Structures An overview of the lecture 2 Models of parallel computation Characteristics of SIMD models Design issue.
Overview Efficient Parallel Algorithms COMP308. COMP 308 Exam Time allowed : 2.5 hours Answer four questions (out of six). If you attempt to answer more.
Chapter 10 in textbook. Sorting Algorithms
Interconnection Networks 1 Interconnection Networks (Chapter 6) References: [1,Wilkenson and Allyn, Ch. 1] [2, Akl, Chapter 2] [3, Quinn, Chapter 2-3]
1 Lecture 8 Architecture Independent (MPI) Algorithm Design Parallel Computing Fall 2007.
Advanced Topics in Algorithms and Data Structures 1 Lecture 4 : Accelerated Cascading and Parallel List Ranking We will first discuss a technique called.
CSE621/JKim Lec4.1 9/20/99 CSE621 Parallel Algorithms Lecture 4 Matrix Operation September 20, 1999.
Parallel Routing Bruce, Chiu-Wing Sham. Overview Background Routing in parallel computers Routing in hypercube network –Bit-fixing routing algorithm –Randomized.
Spring 2006Parallel Processing, Fundamental ConceptsSlide 1 Part I Fundamental Concepts.
Interconnection Network PRAM Model is too simple Physically, PEs communicate through the network (either buses or switching networks) Cost depends on network.
Communication operations Efficient Parallel Algorithms COMP308.
Shared Memory and Message Passing
Models of Parallel Computation Advanced Algorithms & Data Structures Lecture Theme 12 Prof. Dr. Th. Ottmann Summer Semester 2006.
Interconnection Network Topologies
Examples Broadcasting and Gossiping. Broadcast on ring (Safe and Forward)
Advanced Topics in Algorithms and Data Structures Page 1 An overview of lecture 3 A simple parallel algorithm for computing parallel prefix. A parallel.
Topic Overview One-to-All Broadcast and All-to-One Reduction
1 Static Interconnection Networks CEG 4131 Computer Architecture III Miodrag Bolic.
ECE669 L16: Interconnection Topology March 30, 2004 ECE 669 Parallel Computer Architecture Lecture 16 Interconnection Topology.
Value Iteration 0: step 0. Insertion Sort Array index67 Iteration i. Repeatedly swap element i with.
1 Sorting Algorithms - Rearranging a list of numbers into increasing (strictly non-decreasing) order. ITCS4145/5145, Parallel Programming B. Wilkinson.
Interconnect Network Topologies
Interconnection Networks. Applications of Interconnection Nets Interconnection networks are used everywhere! ◦ Supercomputers – connecting the processors.
Interconnect Networks
Network Topologies Topology – how nodes are connected – where there is a wire between 2 nodes. Routing – the path a message takes to get from one node.
CS668- Lecture 2 - Sept. 30 Today’s topics Parallel Architectures (Chapter 2) Memory Hierarchy Busses and Switched Networks Interconnection Network Topologies.
Basic Communication Operations Based on Chapter 4 of Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis and Vipin Kumar These.
CSE Advanced Computer Architecture Week-11 April 1, 2004 engr.smu.edu/~rewini/8383.
1 Parallel Sorting Algorithms. 2 Potential Speedup O(nlogn) optimal sequential sorting algorithm Best we can expect based upon a sequential sorting algorithm.
A Taste of Parallel Algorithms A.Broumandnia, 1 Learn about the nature of parallel algorithms and complexity: By implementing 5 building-block.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
A.Broumandnia, 1 4 Models of Parallel Processing Topics in This Chapter 4.1 Development of Early Models 4.2 SIMD versus MIMD Architectures.
Topic Overview One-to-All Broadcast and All-to-One Reduction All-to-All Broadcast and Reduction All-Reduce and Prefix-Sum Operations Scatter and Gather.
Outline  introduction  Sorting Networks  Bubble Sort and its Variants 2.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Parallel and Distributed Algorithms Eric Vidal Reference: R. Johnsonbaugh and M. Schaefer, Algorithms (International Edition) Pearson Education.
Lecture 3 Innerconnection Networks for Parallel Computers
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 January Session 4.
Winter 2014Parallel Processing, Mesh-Based ArchitecturesSlide 1 12 Mesh-Related Architectures Study variants of simple mesh and torus architectures: Variants.
Data Structures Types of Data Structure Data Structure Operations Examples Choosing Data Structures Data Structures in Alice.
1. 2 Sorting Algorithms - rearranging a list of numbers into increasing (strictly nondecreasing) order.
Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Reduced slides for CSCE 3030 To accompany the text ``Introduction.
1 Interconnection Networks. 2 Interconnection Networks Interconnection Network (for SIMD/MIMD) can be used for internal connections among: Processors,
Interconnection Networks
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:
2016/1/6Part I1 A Taste of Parallel Algorithms. 2016/1/6Part I2 We examine five simple building-block parallel operations and look at the corresponding.
Super computers Parallel Processing
HYPERCUBE ALGORITHMS-1
Topology How the components are connected. Properties Diameter Nodal degree Bisection bandwidth A good topology: small diameter, small nodal degree, large.
Basic Communication Operations Carl Tropper Department of Computer Science.
Unit-8 Sorting Algorithms Prepared By:-H.M.PATEL.
Parallel Processing & Distributed Systems Thoai Nam Chapter 3.
A Taste of Parallel Algorithms A.Broumandnia, 1 Learn about the nature of parallel algorithms and complexity: By implementing 5 building-block.
Interconnection Networks Communications Among Processors.
Distributed and Parallel Processing
Connection System Serve on mutual connection processors and memory .
Communication operations
Static Interconnection Networks
High Performance Computing & Bioinformatics Part 2 Dr. Imad Mahgoub
Presentation transcript:

Distributed-Memory or Graph Models The sea of interconnection networks Parallel Processing, Fundamental Concepts

Some Interconnection Networks ––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––– Number Network Bisection Node Network name(s) of nodes diameter width degree 1D mesh (linear array) k k – 1 1 2 1D torus (ring, loop) k k/2 2 2 2D Mesh k2 2k – 2 k 4 2D torus (k-ary 2-cube) k2 k 2k 4 3D mesh k3 3k – 3 k2 6 3D torus (k-ary 3-cube) k3 3k/2 2k2 6 Pyramid (4k2 – 1)/3 2 log2 k 2k 9 Binary tree 2l – 1 2l – 2 1 3 4-ary hypertree 2l(2l+1 – 1) 2l 2l+1 6 Butterfly 2l(l + 1) 2l 2l 4 Hypercube 2l l 2l–1 l Cube-connected cycles 2l l 2l 2l–1 3 Shuffle-exchange 2l 2l – 1  2l–1/l 4 unidir. De Bruijn 2l l 2l /l 4 unidir. –––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––– Parallel Processing, Fundamental Concepts

The Five Building-Block Computations Semigroup computation: aka tree or fan-in computation All processors to get the computation result at the end Parallel prefix computation: The ith processor to hold the ith prefix result at the end Packet routing: Send a packet from a source to a destination processor Broadcasting: Send a packet from a source to all processors Sorting: Arrange a set of keys, stored one per processor, so that the ith processor holds the ith key in ascending order Parallel Processing, Fundamental Concepts

Line & Ring Architecture Fig. A linear array of nine processors and its ring variant. Max node degree d = 2 Network diameter D = p – 1 ( p/2 ) Bisection width B = 1 ( 2 ) Parallel Processing, Fundamental Concepts

Two-Dimensional (2D) Mesh Nonsquare mesh (r rows, p/r col’s) also possible Max node degree d = 4 Network diameter D = 2p – 2 ( p ) Bisection width B  p ( 2p ) Fig. 2D mesh of 9 processors and its torus variant. Parallel Processing, Fundamental Concepts

3D Mesh

(Balanced) Binary Tree Architecture Complete binary tree 2q – 1 nodes, 2q–1 leaves Balanced binary tree Leaf levels differ by 1 Max node degree d = 3 Network diameter D = 2 log2 p ( - 1 ) Bisection width B = 1 Fig. A balanced (but incomplete) binary tree of nine processors. Parallel Processing, Fundamental Concepts

Fat-Tree (HyperTree)

Pyramid

Distributed-Shared-Memory Architecture (Clique / Complete Graph) Max node degree d = p – 1 Network diameter D = 1 Bisection width B = p/2 p/2 Costly to implement Not scalable But . . . Conceptually simple Easy to program Fig. A shared-variable architecture modeled as a complete graph. Parallel Processing, Fundamental Concepts

Global-Shared-Memory Architecture (PRAM)

Hyper Cube

Cube-Connected-Cycles (CCC)

Shuffle-Exchange

De Burijn

Butterfly Network

Architecture/Algorithm Combinations Semi-group Parallel prefix Packet routing Broad-casting Sorting We will spend more time on linear array and binary tree and less time on mesh and shared memory Parallel Processing, Fundamental Concepts

Algorithms for a Linear Array Fig. Maximum-finding on a linear array of nine processors. For general semigroup computation: Phase 1: Partial result is propagated from left to right Phase 2: Result obtained by processor p – 1 is broadcast leftward Parallel Processing, Fundamental Concepts

Linear Array Prefix Sum Computation Fig. Computing prefix sums on a linear array of nine processors. Diminished parallel prefix computation: The ith processor obtains the result up to element i – 1 Parallel Processing, Fundamental Concepts

Linear Array Routing and Broadcasting Routing and broadcasting on a linear array of nine processors. To route from processor i to processor j: Compute j – i to determine distance and direction To broadcast from processor i: Send a left-moving and a right-moving broadcast message Parallel Processing, Fundamental Concepts

Linear-Array Prefix Sum Computation Fig. 2.8 Computing prefix sums on a linear array with two items per processor. Parallel Processing, Fundamental Concepts

Linear Array Sorting (Externally Supplied Keys) Fig. Sorting on a linear array with the keys input sequentially from the left. Parallel Processing, Fundamental Concepts

Linear Array Sorting (Internally Stored Keys) Fig. Odd-even transposition sort on a linear array. T(1) = W(1) = p log2 p T(p) = p W(p)  p2/2 S(p) = log2 p R(p) = p/(2 log2 p) Parallel Processing, Fundamental Concepts

Parallel Processing, Fundamental Concepts Algorithms for a 2D Mesh Finding the max value on a 2D mesh. Computing prefix sums on a 2D mesh Parallel Processing, Fundamental Concepts

Routing and Broadcasting on a 2D Mesh Nonsquare mesh (r rows, p/r col’s) also possible Routing: Send along the row to the correct column; route in column Broadcasting: Broadcast in row; then broadcast in all column Routing and broadcasting on a 9-processors 2D mesh or torus Parallel Processing, Fundamental Concepts

Sorting on a 2D Mesh Using Shearsort 1 2 3 Number of iterations = log2 p Compare-exchange steps in each iteration = 2p Total steps = (log2 p + 1) p Fig. The shearsort algorithm on a 3  3 mesh. Parallel Processing, Fundamental Concepts

Algorithms for a Binary Tree Semigroup computation and broadcasting on a binary tree. Parallel Processing, Fundamental Concepts

Binary Tree Packet Routing Preorder indexing Node index is a representation of the path from the tree root Packet routing on a binary tree with two indexing schemes. Parallel Processing, Fundamental Concepts

Binary Tree Parallel Prefix Computation Upward propagation Fig. Parallel prefix computation on a binary tree of processors. Downward propagation Parallel Processing, Fundamental Concepts

Node Function in Binary Tree Parallel Prefix [i, k] [0, i – 1] Two binary operations: one during the upward propagation phase, and another during downward propagation  Upward propagation Insert latches for systolic operation Downward propagation  [i, j – 1] [0, j – 1] [0, i – 1] [ j, k] Parallel Processing, Fundamental Concepts

2.6 Algorithms with Shared Variables 12 Semigroup computation: Each processor can perform the computation locally Parallel prefix computation: Same as semigroup, except only data from smaller-index processors are combined Packet routing: Trivial Broadcasting: One step with all-port (p – 1 steps with single-port) communication Sorting: Each processor determines the rank of its data element; followed by routing 3 8 6 5 10 1 Rank[15] = 8 (8 others smaller) Rank[4] = 2 (1 & 3 smaller) 15 4 Parallel Processing, Fundamental Concepts