Heiko Schröder, 1998 ROUTING ? Sorting? Image Processing? Sparse Matrices? Reconfigurable Meshes !

Slides:



Advertisements
Similar presentations
COMPUTER NETWORK TOPOLOGIES
Advertisements

Basic Communication Operations
Load Balancing Parallel Applications on Heterogeneous Platforms.
PRAM Algorithms Sathish Vadhiyar. PRAM Model - Introduction Parallel Random Access Machine Allows parallel-algorithm designers to treat processing power.
Parallel Sorting Sathish Vadhiyar. Sorting  Sorting n keys over p processors  Sort and move the keys to the appropriate processor so that every key.
Lecture 3: Parallel Algorithm Design
CIS December '99 Introduction to Parallel Architectures Dr. Laurence Boxer Niagara University.
Advanced Topics in Algorithms and Data Structures Lecture pg 1 Recursion.
Parallel Architectures: Topologies Heiko Schröder, 2003.
Heiko Schröder, 2003 Parallel Architectures 1 Various communication networks State of the art technology Important aspects of routing schemes Known results.
Application Specific Massive Parallelism Systolic and Instruction Systolic Computing Heiko Schröder, 2003 LOCOMAP.
Parallel Architectures: Topologies Heiko Schröder, 2003.
Slide 1 Parallel Computation Models Lecture 3 Lecture 4.
1 Lecture 23: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Appendix E)
Advanced Topics in Algorithms and Data Structures An overview of the lecture 2 Models of parallel computation Characteristics of SIMD models Design issue.
Overview Efficient Parallel Algorithms COMP308. COMP 308 Exam Time allowed : 2.5 hours Answer four questions (out of six). If you attempt to answer more.
CSE621/JKim Lec4.1 9/20/99 CSE621 Parallel Algorithms Lecture 4 Matrix Operation September 20, 1999.
Lecture 21: Parallel Algorithms
Interconnection Network PRAM Model is too simple Physically, PEs communicate through the network (either buses or switching networks) Cost depends on network.
Models of Parallel Computation Advanced Algorithms & Data Structures Lecture Theme 12 Prof. Dr. Th. Ottmann Summer Semester 2006.
Sorting Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'', Addison Wesley,
Examples Broadcasting and Gossiping. Broadcast on ring (Safe and Forward)
1 Lecture 24: Parallel Algorithms I Topics: sort and matrix algorithms.
Topic Overview One-to-All Broadcast and All-to-One Reduction
Design of parallel algorithms Matrix operations J. Porras.
1 Lecture 25: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) Review session,
Simulating a CRCW algorithm with an EREW algorithm Lecture 4 Efficient Parallel Algorithms COMP308.
Chapter 1: Matrices Definition 1: A matrix is a rectangular array of numbers arranged in horizontal rows and vertical columns. EXAMPLE:
Interconnect Network Topologies
Interconnect Networks
On-Chip Networks and Testing
1 Lecture 21: Core Design, Parallel Algorithms Today: ARM Cortex A-15, power, sort and matrix algorithms.
Basic Communication Operations Based on Chapter 4 of Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis and Vipin Kumar These.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February 8, 2005 Session 8.
Computer Science 112 Fundamentals of Programming II Introduction to Graphs.
COMP308 Efficient Parallel Algorithms
CALTECH CS137 Winter DeHon 1 CS137: Electronic Design Automation Day 12: February 6, 2006 Sorting.
Topic Overview One-to-All Broadcast and All-to-One Reduction All-to-All Broadcast and Reduction All-Reduce and Prefix-Sum Operations Scatter and Gather.
Outline  introduction  Sorting Networks  Bubble Sort and its Variants 2.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Matrices CHAPTER 8.1 ~ 8.8. Ch _2 Contents  8.1 Matrix Algebra 8.1 Matrix Algebra  8.2 Systems of Linear Algebra Equations 8.2 Systems of Linear.
Bulk Synchronous Processing (BSP) Model Course: CSC 8350 Instructor: Dr. Sushil Prasad Presented by: Chris Moultrie.
Winter 2014Parallel Processing, Mesh-Based ArchitecturesSlide 1 12 Mesh-Related Architectures Study variants of simple mesh and torus architectures: Variants.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February 3, 2005 Session 7.
Communication and Computation on Arrays with Reconfigurable Optical Buses Yi Pan, Ph.D. IEEE Computer Society Distinguished Visitors Program Speaker Department.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Shared versus Switched Media.
Guaranteed Smooth Scheduling in Packet Switches Isaac Keslassy (Stanford University), Murali Kodialam, T.V. Lakshman, Dimitri Stiliadis (Bell-Labs)
Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Reduced slides for CSCE 3030 To accompany the text ``Introduction.
1 Interconnection Networks. 2 Interconnection Networks Interconnection Network (for SIMD/MIMD) can be used for internal connections among: Processors,
Basic Linear Algebra Subroutines (BLAS) – 3 levels of operations Memory hierarchy efficiently exploited by higher level BLAS BLASMemor y Refs. FlopsFlops/
Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:
2016/1/6Part I1 A Taste of Parallel Algorithms. 2016/1/6Part I2 We examine five simple building-block parallel operations and look at the corresponding.
5 PRAM and Basic Algorithms
Matrices Digital Lesson. Copyright © by Houghton Mifflin Company, Inc. All rights reserved. 2 A matrix is a rectangular array of real numbers. Each entry.
HYPERCUBE ALGORITHMS-1
Topology How the components are connected. Properties Diameter Nodal degree Bisection bandwidth A good topology: small diameter, small nodal degree, large.
PMLAB, IECS, FCU Designing Efficient Matrix Transposition on Various Interconnection Networks Using Tensor Product Formulation Presented by Chin-Yi Tsai.
Basic Communication Operations Carl Tropper Department of Computer Science.
3/12/2013Computer Engg, IIT(BHU)1 PRAM ALGORITHMS-3.
Unit-8 Sorting Algorithms Prepared By:-H.M.PATEL.
Spanning Trees Dijkstra (Unit 10) SOL: DM.2 Classwork worksheet Homework (day 70) Worksheet Quiz next block.
Chapter 5: Matrices and Determinants Section 5.1: Matrix Addition.
MATRICES. Introduction Matrix algebra has several uses in economics as well as other fields of study. One important application of Matrices is that it.
Matrices, Vectors, Determinants.
CSc 8530 Matrix Multiplication and Transpose By Jaman Bhola.
Distributed-Memory or Graph Models
Lecture 16: Parallel Algorithms I
Architecture of Systola 1024
Mesh-Connected Illiac Networks
High Performance Computing & Bioinformatics Part 2 Dr. Imad Mahgoub
Presentation transcript:

Heiko Schröder, 1998 ROUTING ? Sorting? Image Processing? Sparse Matrices? Reconfigurable Meshes !

Heiko Schröder, 1998 Reconfigurable mesh 2 P A R C Reconfigurable architectures FPGAs reconfigurable multibus reconfigurable networks (Transputers, PVM) dynamically reconfigurable mesh Aim: efficiency special purpose --> general purpose architectures

Heiko Schröder, 1998 Reconfigurable mesh 3 P A R C contents 1.) Motivation for the reconfigurable mesh 2.) Routing (and sorting): better than PRAM better than mesh 3.) Image processing 4.) Sparse matrix multiplication 5.) Bounded bus length

Heiko Schröder, 1998 Reconfigurable mesh 4 P A R C PRAM diameter O(1) bisection width  (n) cut EREW CRCW

Heiko Schröder, 1998 Reconfigurable mesh 5 P A R C Mesh/Torus Diameter  ( ) bisection width  ( ) 2D mesh

Heiko Schröder, 1998 Reconfigurable mesh 6 P A R C Hypercube 0-D D D D 01 4-D diameter O(log n) bisection width  ( n)

Heiko Schröder, 1998 Reconfigurable mesh 7 P A R C reconfigurable mesh reconfigurable mesh = mesh + interior connections 15 positionsdiameter 1 !! low cost

Heiko Schröder, 1998 Reconfigurable mesh 8 P A R C global OR ** “V” Time: O(1) on RM --  (log n) on EREW-PRAM

Heiko Schröder, 1998 Reconfigurable mesh 9 P A R C Prefix sum * Fast but expensive Time : O(1) Area:  (nxn)

Heiko Schröder, 1998 Reconfigurable mesh 10 P A R C Modulo 3 counter * 1 mod 3 Time: O(1) on RM  (log n / log log n) on CRCW-PRAM

Heiko Schröder, 1998 Reconfigurable mesh 11 P A R C 2 digit numbers to the basis of k represent all numbers smaller than k 2. 1.) determine x mod k (=lsd) 2.) count number of “wraps” (=msd). modulo k 2 counter (ranking) * 1 mod k --> modulo k 2 counting in 2 steps on a k x k 2 array

Heiko Schröder, 1998 Reconfigurable mesh 12 P A R C enumeration / prefix sum time: O(log n) wire efficiency ! -- (compared with tree) 1/2 number of processors

Heiko Schröder, 1998 Reconfigurable mesh 13 P A R C permutation routing - 2 steps n x n 2 steps !!!

Heiko Schröder, 1998 Reconfigurable mesh 14 P A R C Kunde’s all-to-all mapping Sorting: sort blocks all-to-all (columns) sort blocks all-to-all (rows) o-e-sort blocks

Heiko Schröder, 1998 Reconfigurable mesh 15 P A R C sorting in constant time Complete sort: sort blocks all-to-all (2) sort blocks all-to-all (2) o-e-sort blocks block broadcast (1) Sort blocks: broadcast (1) rank (2)

Heiko Schröder, 1998 Reconfigurable mesh 16 P A R C better than PRAM --- but useless!!!

Heiko Schröder, 1998 Reconfigurable mesh 17 P A R C Kunde’s all-to-all mapping n x n

Heiko Schröder, 1998 Reconfigurable mesh 18 P A R C vertical all-to-all

Heiko Schröder, 1998 Reconfigurable mesh 19 P A R C horizontal all-to-all

Heiko Schröder, 1998 Reconfigurable mesh 20 P A R C Use of bus 1 step 2 steps 3 steps k/2 steps 3 steps 2 steps 1 step (k/2) 2 steps

Heiko Schröder, 1998 Reconfigurable mesh 21 P A R C sorting in optimal time Kunde / Schröder (k/2) 2 steps k=n 1/3 each step takes n 1/3 time --> T= n/4 x 2 T = n/2 all-to-all Sorting: sort blocks (O(n 2/3 )) all-to-all (n/2) sort blocks (O(n 2/3 )) all-to-all (n/2) sort blocks (O(n 2/3 )) time: n + o(n) x 2 /2

Heiko Schröder, 1998 Reconfigurable mesh 22 P A R C Why optimal? Sorter for n keys Bisection of data with k wires Sorting time > n/k

Heiko Schröder, 1998 Reconfigurable mesh 23 P A R C Use of theorem 1.) n keys on a kxk RM: Time >= n/k Proof: Wherever the data is stored there is always a bisection of length k -- this can be demonstrated sweeping left right through the array. Q.e.d. 2.) nxn keys on an nxn RM: Time >= n. Proof: trivial

Heiko Schröder, 1998 Reconfigurable mesh 24 P A R C n + o(n) Optimal --- but...

Heiko Schröder, 1998 Reconfigurable mesh 25 P A R C enumeration / prefix sum time: O(log n) wire efficiency ! -- (compared with tree) 1/2 number of processors

Heiko Schröder, 1998 Reconfigurable mesh 26 P A R C move and smooth ABCD-routing A B C D Row-major enumeration of A, B, C and D packets within each quadrant in time 4 log n. Determine destination position of each packet.

Heiko Schröder, 1998 Reconfigurable mesh 27 P A R C elementary steps move smooth collect

Heiko Schröder, 1998 Reconfigurable mesh 28 P A R C time analysis A B C D movesmoothABCD collect time: 3 x n/2 T=3n+o(n)

Heiko Schröder, 1998 Reconfigurable mesh 29 P A R C T < 2n 4 destination squares time: 3n + 4 log n 16 destination squares time: 2n + 16 log n 64 destination squares time: 12/7 n + 64 log n mesh-diameter: 2n

Heiko Schröder, 1998 Reconfigurable mesh 30 P A R C enough of routing/sorting Constant factor ! Can we do better ? What kind of problems ? Image processing Sparse problems !

Heiko Schröder, 1998 Reconfigurable mesh 31 P A R C Image processing Border following Edge detection Component labeling Skeletons Transforms

Heiko Schröder, 1998 Reconfigurable mesh 32 P A R C Component labelling Object Define border (candidates) Set bus While own label is not received: 1.) Candidates brake bus and send their label a) clockwise b) anti-clockwise 2.) Candidates switch off and restore bus if they see smaller label Time: O(1) -- O(log n)

Heiko Schröder, 1998 Reconfigurable mesh 33 P A R C Transforms Wavelet transform: Time log n on RM -- time n on mesh FFT: Time n on RM and mesh Hough transform: Time m x log n on RM -- time m x n on mesh

Heiko Schröder, 1998 Reconfigurable mesh 34 P A R C systolic matrix multiplication B AC time: n i j

Heiko Schröder, 1998 Reconfigurable mesh 35 P A R C sparse matrix multiplication A x B = C Time: n (nxn mesh) A and B column sparse (k 2 ) A and B row sparse (k 2 ) A row sparse, B column sparse (k 2 ) A column sparse, B row sparse

Heiko Schröder, 1998 Reconfigurable mesh 36 P A R C unlimited bus length ring broadcast

Heiko Schröder, 1998 Reconfigurable mesh 37 P A R C A row-sparse B column - sparse Repeat k times Begin horizontal ring broadcast Repeat k times vertical ring broadcast End. B AC k k

Heiko Schröder, 1998 Reconfigurable mesh 38 P A R C lower bound (c,r) A B C k=3 n=48

Heiko Schröder, 1998 Reconfigurable mesh 39 P A R C splitting the problem Repeat k times Begin vertical ring broadcast Repeat s times horizontal ring broadcast End. AB/C first s s k B-elements T A=A s +A r C=A s B+A r B s time: ks s C=C+C sr

Heiko Schröder, 1998 Reconfigurable mesh 40 P A R C CR A first s s s A has nk non-zero elements  A r has at most nk/s non-zero rows  for s=  n A r has at most k  n non-zero rows. A s B is a CC- problem  it takes time k  n. A=A s +A r

Heiko Schröder, 1998 Reconfigurable mesh 41 P A R C A r B calculating products A r k A-elements A r k B-elements B/C T time: k 2 elements

Heiko Schröder, 1998 Reconfigurable mesh 42 P A R C column sum i i+1 i-1 j row i time: log n only elements per column rout time:

Heiko Schröder, 1998 Reconfigurable mesh 43 P A R C routing within columns rout time:

Heiko Schröder, 1998 Reconfigurable mesh 44 P A R C Reconfigurable architectures Reconfigurable mesh ? constant diameter ! No !!! Physical laws!

Heiko Schröder, 1998 Reconfigurable mesh 45 P A R C Physical limits c= km/sec 30cm/ns on chip: 1cm/ns --> bounded bus length good idea !

Heiko Schröder, 1998 Reconfigurable mesh 46 P A R C bounded broadcast time: k + n/l

Heiko Schröder, 1998 Reconfigurable mesh 47 P A R C creating main stations time: k

Heiko Schröder, 1998 Reconfigurable mesh 48 P A R C Create main stations 1,…,k for A and B (time: n/l+k) For i=1,…,k do Begin horizontal ring broadcast i of A For j=1,…,k do vertical ring broadcast j of B End. A row-sparse B column - sparse B AC k k

Heiko Schröder, 1998 Reconfigurable mesh 49 P A R C Create main stations 1, …, k for A (time: n/l+k) For i=1,…,k do Begin horizontal ring broadcast i k bounded vertical broadcasts of products merging new products End. A and B column-sparse AB/C k k elements T i i

Heiko Schröder, 1998 Reconfigurable mesh 50 P A R C remove minor stations

Heiko Schröder, 1998 Reconfigurable mesh 51 P A R C results Time: n (nxn mesh) A and B column sparse (k 2 ) (k 2 +2n/l) A and B row sparse (k 2 ) (k 2 +2n/l) A row sparse, B column sparse (k 2 ) (k 2 +n/l) A column sparse, B row sparse (+11n/l) image processing sorting routing load balancing better than the mesh ! (Kunde, Middendorf, Schmeck, Schröder, Turner) (Kapoor, Kunde, Kaufmann, Schroeder, Sibeyn)

The End The RM is in some cases “better” than PRAM The RM is always at least as “good” as mesh The RM is often “better” than the mesh