HYPERCUBE ALGORITHMS-1

Slides:



Advertisements
Similar presentations
Basic Communication Operations
Advertisements

Lecture 9: Group Communication Operations
Routing in a Parallel Computer. A network of processors is represented by graph G=(V,E), where |V| = N. Each processor has unique ID between 1 and N.
Binary Trees CSC 220. Your Observations (so far data structures) Array –Unordered Add, delete, search –Ordered Linked List –??
CIS December '99 Introduction to Parallel Architectures Dr. Laurence Boxer Niagara University.
Parallel Architectures: Topologies Heiko Schröder, 2003.
Heiko Schröder, 2003 Parallel Architectures 1 Various communication networks State of the art technology Important aspects of routing schemes Known results.
Parallel Architectures: Topologies Heiko Schröder, 2003.
Advanced Topics in Algorithms and Data Structures An overview of the lecture 2 Models of parallel computation Characteristics of SIMD models Design issue.
Interconnection Networks 1 Interconnection Networks (Chapter 6) References: [1,Wilkenson and Allyn, Ch. 1] [2, Akl, Chapter 2] [3, Quinn, Chapter 2-3]
Parallel Routing Bruce, Chiu-Wing Sham. Overview Background Routing in parallel computers Routing in hypercube network –Bit-fixing routing algorithm –Randomized.
ICN’s The n-D hypercube (n-cube) contains 2^n nodes (processors).
A Data Compression Algorithm: Huffman Compression
Interconnection Network PRAM Model is too simple Physically, PEs communicate through the network (either buses or switching networks) Cost depends on network.
1 Tuesday, October 03, 2006 If I have seen further, it is by standing on the shoulders of giants. -Isaac Newton.
Communication operations Efficient Parallel Algorithms COMP308.
CS 684.
Models of Parallel Computation Advanced Algorithms & Data Structures Lecture Theme 12 Prof. Dr. Th. Ottmann Summer Semester 2006.
Interconnection Network Topologies
CS 584. Algorithm Analysis Assumptions n Consider ring, mesh, and hypercube. n Each process can either send or receive a single message at a time. n No.
Examples Broadcasting and Gossiping. Broadcast on ring (Safe and Forward)
Topic Overview One-to-All Broadcast and All-to-One Reduction
The Euler-tour technique
1 Lecture 25: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) Review session,
1 Static Interconnection Networks CEG 4131 Computer Architecture III Miodrag Bolic.
Interconnect Network Topologies
Interconnection Networks. Applications of Interconnection Nets Interconnection networks are used everywhere! ◦ Supercomputers – connecting the processors.
2a.1 Evaluating Parallel Programs Cluster Computing, UNC-Charlotte, B. Wilkinson.
3.0 State Space Representation of Problems 3.1 Graphs 3.2 Formulating Search Problems 3.3 The 8-Puzzle as an example 3.4 State Space Representation using.
Interconnect Networks
Network Topologies Topology – how nodes are connected – where there is a wire between 2 nodes. Routing – the path a message takes to get from one node.
Data Structures Arrays both single and multiple dimensions Stacks Queues Trees Linked Lists.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Lecture 12: Parallel Sorting Shantanu Dutt ECE Dept. UIC.
Basic Communication Operations Based on Chapter 4 of Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis and Vipin Kumar These.
WAN technologies and routing Packet switches and store and forward Hierarchical addresses, routing and routing tables Routing table computation Example.
Offline Algorithmic Techniques for Several Content Delivery Problems in Some Restricted Types of Distributed Systems Mugurel Ionut Andreica, Nicolae Tapus.
1 Parallel Sorting Algorithms. 2 Potential Speedup O(nlogn) optimal sequential sorting algorithm Best we can expect based upon a sequential sorting algorithm.
A Taste of Parallel Algorithms A.Broumandnia, 1 Learn about the nature of parallel algorithms and complexity: By implementing 5 building-block.
Dynamic Interconnect Lecture 5. COEN Multistage Network--Omega Network Motivation: simulate crossbar network but with fewer links Components: –N.
Topic Overview One-to-All Broadcast and All-to-One Reduction All-to-All Broadcast and Reduction All-Reduce and Prefix-Sum Operations Scatter and Gather.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Lecture 10 Trees –Definiton of trees –Uses of trees –Operations on a tree.
Lecture 3 Innerconnection Networks for Parallel Computers
Content Addressable Network CAN. The CAN is essentially a distributed Internet-scale hash table that maps file names to their location in the network.
Huffman coding Content 1 Encoding and decoding messages Fixed-length coding Variable-length coding 2 Huffman coding.
Graph Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Adapted for 3030 To accompany the text ``Introduction to Parallel Computing'',
InterConnection Network Topologies to Minimize graph diameter: Low Diameter Regular graphs and Physical Wire Length Constrained networks Nilesh Choudhury.
Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Reduced slides for CSCE 3030 To accompany the text ``Introduction.
Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:
2016/1/6Part I1 A Taste of Parallel Algorithms. 2016/1/6Part I2 We examine five simple building-block parallel operations and look at the corresponding.
Foundation of Computing Systems
Super computers Parallel Processing
Basic Communication Operations Carl Tropper Department of Computer Science.
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
3/12/2013Computer Engg, IIT(BHU)1 PRAM ALGORITHMS-3.
Unit-8 Sorting Algorithms Prepared By:-H.M.PATEL.
Parallel and Distributed Simulation Deadlock Detection & Recovery: Performance Barrier Mechanisms.
Parallel Processing & Distributed Systems Thoai Nam Chapter 3.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Interconnection Networks Communications Among Processors.
Distributed-Memory or Graph Models
2016/7/2Appendices A and B1 Introduction to Distributed Algorithm Appendix A: Pseudocode Conventions Appendix B: Graphs and Networks Teacher: Chun-Yuan.
Interconnection topologies
Course Outline Introduction in algorithms and applications
Source Code for Data Structures and Algorithm Analysis in C (Second Edition) – by Weiss
Multi-Node Broadcasting in Hypercube and Star Graphs
Communication operations
High Performance Computing & Bioinformatics Part 2 Dr. Imad Mahgoub
A SRAM-based Architecture for Trie-based IP Lookup Using FPGA
Presentation transcript:

HYPERCUBE ALGORITHMS-1 3/12/2013 Computer Engg, IIT(BHU)

Hypercubes of 0, 1, 2 and 3 dimensions Computation Model Hypercubes of 0, 1, 2 and 3 dimensions

Computation model Each node of a d-dimensional hypercube is numbered using d bits. Hence, there are 2^d processors in a d-dimensional hypercube. Two nodes are connected by a direct link if their numbers differ only by one bit.

Computation Model The diameter of a d-dimensional hypercube is d as we need to flip at most d bits (traverse d links) to reach one processor from another. The bisection width of a d-dimensional hypercube is 2^(d-1).

Computation Model The hypercube is a highly scalable architecture. Two d-dimensional hypercubes can be easily combined to form a d+1-dimensional hypercube. Two variants of Hypercube – In, sequential hypercube, one processor can communicate only with one neighbour at a time. In parallel, it can communicate with all neighbors. The hypercube has several variants like butterfly, shuffle-exchange network and cube-connected cycles.

The Butterfly Network Algorithms for Hypercube can be adapted for Butterfly network and vice-versa. A d-dimensional butterfly (Bd) has (d+1)2^d processors and d2^(d+1) links. Processor represented as tuple <r,l>, r is row and l is level. Each processor u, is connected to two processors in level l+1: v = <r, l+1> and w = <rl+1,l+1> (u,v) is called direct link and (u,w) is called cross link.

The Butterfly Network There exists a unique path of length d from u at level 0 and v at level d, called as greedy path. Hence, diameter of butterfly network is 2d. When each row of Bd is collapsed into a single processor, preserving all the links then resultant graph is a hypercube (Hd).

The Butterfly Network Each step of Bd can be simulated in one step on parallel version of Hd. Normal butterfly algorithm – If at any given time, processors in only level participate, it is normal algorithm. A single step of any normal algorithm can be simulated on sequential Hd.

Embedding A general mapping of one network G(V1, E1) into another H(v2, E2) is called embedding. Embedding of a ring: If 0,1,2... 2d-1 are processors of a ring, processors 0 is mapped to processor 000...0 of the hypercube. Mapping is obtained using gray codes.

Embedding Embedding of a binary tree: A p-leaf (p=2d) binary tree T can be embedded into Hd. More than one processor of T have to be mapped into same processor of Hd. If tree leaves are 0,1,2...p-1, then leaf i is mapped to ith processor of Hd. Each processor of T is mapped to the same processor of Hd as its leftmost descendant leaf.

PPR Routing: A Greedy Algorithm In Bd, Origin of packet is at level 0 and destination is level d. Greedy algorithm for each packet is to choose a greedy path between its origin and destination. Distance travelled by any packet is d. Algorithm runs in O(2d/2) time, the average queue length being O(2d/2)

Fundamental Algorithms: Broadcasting Since, A hypercube with 2d nodes can be regarded as a d-dimensional mesh with two nodes in each dimension. The mesh algorithm can be generalized to a hypercube and the operation is carried out in d (= log p) steps.

Broadcasting One-to-all broadcast on a three-dimensional hypercube. The binary representations of node labels are shown in parentheses.

Broadcasting In Binary tree embedding, each node makes two copies of message and sends one to left child and another to right child.

Prefix sum computation Computing prefix sums on an eight-node hypercube. At each node, square brackets show the local prefix sum accumulated in the result buffer and parentheses enclose the contents of the outgoing message buffer for the next step.

Prefix Computation Binary tree embedding can be used. Two phases: forward phase and reverse phase. Forward phase: The leaves start by sending their data up to their parents. Each internal processor on receipt of two items (y is left child and z is right child) computes w=y+z, stores a copy of y and w and sends w to its parent. At the end of d steps, each processor in the tree has stored in its memory the sum of all data items in the subtree rooted at this processor. The root has sum of all elements in tree

Prefix sum computation Reverse Phase: The root starts by sending zero to its left child and y to its right child. Each internal processoron receipt of a datum (say q) from its parent sends q to its left child and q+y to its right child. When ith leaf gets a datum q from its parent, it computes q+xi and stores it as final result.

Data Concentration Assume that there are k<p items distributed arbitrarily. Problem is to move the data into processors 0,1,2.... k-1 of Hd. There are two phases in algorithm: 1. A prefix sums operation is performed to compute the destination address of each data item. 2. Each packet is routed to its destination using the greedy path from its origin to its destination.

Data Concentration In second phase, when packets are routed using greedy paths, we claim that no packet gets to meet any other in path, hence there is no contention. Data concentration can be performed on Bd as well as the sequential Hd in O(d) time.

Selection Given a sequence, problem is to find ith smallest key from sequence. There are two different versions: a) p=n and b) n>p The work optimal algorithm for mesh can be adapted to run optimally on Hd as well. Selection from n=p keys can be performed in O(d) time on Hd.

Selection Step3: Each processor can identify no. of remaining keys in its queue and then perform prefix sum computation. Takes O(n/(p log p +d)) time. Step4: If i>rm, all blocks to left are eliminated else right ones are eliminated. Step5: Do prefix computation to find the destination and then broadcast. Takes O(d) time. Selection on Hd can be performed in time O(n/p log log p + d2log n).