CIS 270 - December '99 Introduction to Parallel Architectures Dr. Laurence Boxer Niagara University.

Slides:



Advertisements
Similar presentations
Basic Communication Operations
Advertisements

Lecture 7-2 : Distributed Algorithms for Sorting Courtesy : Michael J. Quinn, Parallel Programming in C with MPI and OpenMP (chapter 14)
1 Meshes of Trees (MoT) and Applications in Integer Arithmetic Panagiotis Voulgaris Petros Mol Course: Parallel Algorithms.
Advanced Topics in Algorithms and Data Structures Lecture pg 1 Recursion.
Parallel Architectures: Topologies Heiko Schröder, 2003.
Parallel Architectures: Topologies Heiko Schröder, 2003.
Slide 1 Parallel Computation Models Lecture 3 Lecture 4.
Faster finds from Gallo to Google Presented to the Niagara University Bioinformatics Seminar Dr. Laurence Boxer Department of Computer and Information.
Advanced Topics in Algorithms and Data Structures An overview of the lecture 2 Models of parallel computation Characteristics of SIMD models Design issue.
1 Tuesday, November 14, 2006 “UNIX was never designed to keep people from doing stupid things, because that policy would also keep them from doing clever.
Overview Efficient Parallel Algorithms COMP308. COMP 308 Exam Time allowed : 2.5 hours Answer four questions (out of six). If you attempt to answer more.
Interconnection Networks 1 Interconnection Networks (Chapter 6) References: [1,Wilkenson and Allyn, Ch. 1] [2, Akl, Chapter 2] [3, Quinn, Chapter 2-3]
1 Lecture 8 Architecture Independent (MPI) Algorithm Design Parallel Computing Fall 2007.
Interconnection Network PRAM Model is too simple Physically, PEs communicate through the network (either buses or switching networks) Cost depends on network.
1 Tuesday, October 03, 2006 If I have seen further, it is by standing on the shoulders of giants. -Isaac Newton.
Even faster point set pattern matching in 3-d Niagara University and SUNY - Buffalo Laurence Boxer Research partially supported by a.
A Parallel Algorithm for Approximate Regularity, by Laurence Boxer and Russ Miller, A presentation for the Niagara University Research Council, Nov.,
Models of Parallel Computation Advanced Algorithms & Data Structures Lecture Theme 12 Prof. Dr. Th. Ottmann Summer Semester 2006.
Sorting Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'', Addison Wesley,
1 Lecture 24: Parallel Algorithms I Topics: sort and matrix algorithms.
Topic Overview One-to-All Broadcast and All-to-One Reduction
Database System Architectures  Client-server Database System  Parallel Database System  Distributed Database System Wei Jiang.
Chapter 5 Array Processors. Introduction  Major characteristics of SIMD architectures –A single processor(CP) –Synchronous array processors(PEs) –Data-parallel.
1 Reasons for parallelization Can we make GA faster? One of the most promising choices is to use parallel implementations. The reasons for parallelization.
2a.1 Evaluating Parallel Programs Cluster Computing, UNC-Charlotte, B. Wilkinson.
Interconnect Networks
1 Lecture 2: Parallel computational models. 2  Turing machine  RAM (Figure )  Logic circuit model RAM (Random Access Machine) Operations supposed to.
Chapter 4 Performance. Times User CPU time – Time that the CPU is executing the program System CPU time – time the CPU is executing OS routines for the.
A Perspective Hardware and Software
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Lecture 12: Parallel Sorting Shantanu Dutt ECE Dept. UIC.
Basic Communication Operations Based on Chapter 4 of Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis and Vipin Kumar These.
Parallel Algorithms Sorting and more. Keep hardware in mind When considering ‘parallel’ algorithms, – We have to have an understanding of the hardware.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
1 Next Few Classes Networking basics Protection & Security.
Topic Overview One-to-All Broadcast and All-to-One Reduction All-to-All Broadcast and Reduction All-Reduce and Prefix-Sum Operations Scatter and Gather.
Outline  introduction  Sorting Networks  Bubble Sort and its Variants 2.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Analysis of Algorithms
Scheduling Many-Body Short Range MD Simulations on a Cluster of Workstations and Custom VLSI Hardware Sumanth J.V, David R. Swanson and Hong Jiang University.
Parallel and Distributed Algorithms Eric Vidal Reference: R. Johnsonbaugh and M. Schaefer, Algorithms (International Edition) Pearson Education.
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
Analysis of Algorithms CSCI Previous Evaluations of Programs Correctness – does the algorithm do what it is supposed to do? Generality – does it.
Parallel Processing Steve Terpe CS 147. Overview What is Parallel Processing What is Parallel Processing Parallel Processing in Nature Parallel Processing.
Communication and Computation on Arrays with Reconfigurable Optical Buses Yi Pan, Ph.D. IEEE Computer Society Distinguished Visitors Program Speaker Department.
Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Reduced slides for CSCE 3030 To accompany the text ``Introduction.
1 Interconnection Networks. 2 Interconnection Networks Interconnection Network (for SIMD/MIMD) can be used for internal connections among: Processors,
CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
2016/1/5Part I1 Models of Parallel Processing. 2016/1/5Part I2 Parallel processors come in many different varieties. Thus, we often deal with abstract.
Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:
2016/1/6Part I1 A Taste of Parallel Algorithms. 2016/1/6Part I2 We examine five simple building-block parallel operations and look at the corresponding.
Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.
CSCI-455/552 Introduction to High Performance Computing Lecture 23.
Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
HYPERCUBE ALGORITHMS-1
Basic Communication Operations Carl Tropper Department of Computer Science.
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
Unit-8 Sorting Algorithms Prepared By:-H.M.PATEL.
Concurrency and Performance Based on slides by Henri Casanova.
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Tree-Based Networks Cache Coherence Dr. Xiao Qin Auburn University
Interconnection Networks Communications Among Processors.
Distributed-Memory or Graph Models
PERFORMANCE EVALUATIONS
Distributed and Parallel Processing
Interconnection topologies
Lecture 16: Parallel Algorithms I
Chapter 17: Database System Architectures
CS 584.
Coarse Grained Parallel Selection
Presentation transcript:

CIS December '99 Introduction to Parallel Architectures Dr. Laurence Boxer Niagara University

CIS December '99 Parallel Computers Purpose - speed Divide a problem among processors Let each processor work on its portion of problem in parallel (simultaneously) with other processors Ideal - if p is the number of processors, get solution in 1/p of the time used by a computer of 1 processor Actual - rarely get that much speedup, due to delays for interprocessor communications

CIS December '99 Graphs of relevant functions

CIS December '99 Architectural issues Communications diameter - how many communication steps are necessary to send data from processor that has it to processor that needs it - large is bad Bisection width - how many wires must be cut to cut network in half - measure of how fast massive amounts of data can be moved through network - large is good Degree of network - important to scalability (ability to expand number of processors) - large is bad Limitations on speed: Limitation on expansion:

CIS December '99 PRAM - Parallel Random Access Machine Shared memory yields fast communications Source processor writes data to memory Destination processor reads data from memory Fast communications make this model theoretical ideal for fastest possible parallel algorithms for given # of processors Impractical - too many wires if lots of processors Any processor can send data to any other processor in time as follows:

CIS December '99

Notice the tree structure of the previous algorithm:

CIS December '99 Linear array architecture Degree of network: 2 - easily expanded Bisection width: 1 - can’t move large amounts of data efficiently across network Communication diameter: n-1 - won’t perform global communication operations efficiently

CIS December '99 Total on linear array: Assume 1 item per processor Communications diameter implies Since this is the time required to total n items on a RAM, there is no asymptotic benefit to using a linear array for this problem

CIS December '99 Input-based sorting on a linear array The algorithm illustrated is a version of Selection Sort - each processor selects the smallest value it sees and passes others to the right. Time is proportional to communication diameter,

CIS December '99 Mesh architecture Square grid of processors Each processor connected by communication link to N, S, E, W neighbors Degree of network: 4 - makes expansion easy - can introduce adjacent meshes and connect border processors

CIS December '99 Application: sorting Could have initial data all in “wrong half” of mesh, as shown. Since all n items must get to correct half-mesh, time required to sort is In 1 time unit, amt. of data that can cross into correct half of mesh:

CIS December '99 In a mesh, each of these steps takes time. Hence, time for broadcast is

CIS December '99 Semigroup operation (e.g., total) in mesh 1. “Roll up” columns in parallel, totaling each column in last row by sending data downward. 2. Roll up last row to get total in a corner. 3. Broadcast total from corner to all processors. Time:

CIS December '99 Previous algorithm could run in approximately half the time by gathering total in a center, than corner, processor. Mesh total algorithm - continued However, running time is still i.e., still approximately proportional to communication diameter (with smaller constant of proportionality).

CIS December '99 Hypercube Number n of processors is a power of 2 Processors are numbered from 0 to n-1 Connected processors are those whose binary labels differ in exactly 1 bit.

CIS December '99 Illustration of total operation in hypercube. Reverse direction of arrows to broadcast result Time:

CIS December '99 Coarse-grained parallelism Most of previous discussion was of fine-grained parallelism - # of processors comparable to # of data items Realistically, few budgets accommodate such expensive computers - more likely to use coarse-grained parallelism with relatively few processors compared with # of data items. Coarse grained algorithms often based on each processor boiling its share of data down to single partial result, then using fine-grained algorithm to combine these partial results

CIS December '99 Example: coarse-grained total Suppose n data are distributed evenly (n/p per processor among p processors) 1. In parallel, each processor totals its share of the data. Time: Θ(n/p) 2. Use a fine-grained algorithm to add the partial sums (total residing in one processor) and broadcast result to all processors. In case of mesh, time: Total time for mesh: Since, this is Θ(n/p) - optimal.

CIS December '99 More info: Algorithms Sequential and Parallel by Russ Miller and Laurence Boxer Prentice-Hall, 2000 (available December, 1999)