Models of Parallel Computation

Slides:



Advertisements
Similar presentations
Parallel Algorithms.
Advertisements

PRAM Algorithms Sathish Vadhiyar. PRAM Model - Introduction Parallel Random Access Machine Allows parallel-algorithm designers to treat processing power.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Optimal PRAM algorithms: Efficiency of concurrent writing “Computer science is no more about computers than astronomy is about telescopes.” Edsger Dijkstra.
Super computers Parallel Processing By: Lecturer \ Aisha Dawood.
PRAM (Parallel Random Access Machine)
Efficient Parallel Algorithms COMP308
Computer Systems/Operating Systems - Class 8
A Parallel Computational Model for Heterogeneous Clusters Jose Luis Bosque, Luis Pastor, IEEE TRASACTION ON PARALLEL AND DISTRIBUTED SYSTEM, VOL. 17, NO.
Slide 1 Parallel Computation Models Lecture 3 Lecture 4.
10/11/01CSE Models CSE 260 – Introduction to Parallel Computation Topic 6: Models of Parallel Computers October 11-18, 2001.
CS 584. A Parallel Programming Model We need abstractions to make it simple. The programming model needs to fit our parallel machine model. Abstractions.
ECE669 L11: Static Routing Architectures March 4, 2004 ECE 669 Parallel Computer Architecture Lecture 11 Static Routing Architectures.
1 Lecture 8 Architecture Independent (MPI) Algorithm Design Parallel Computing Fall 2007.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

Design, Implementation, and Evaluation of Differentiated Caching Services Ying Lu, Tarek F. Abdelzaher, Avneesh Saxena IEEE TRASACTION ON PARALLEL AND.
Shared Memory and Message Passing
Implications for Programming Models Todd C. Mowry CS 495 September 12, 2002.
CS 240A: Complexity Measures for Parallel Computation.
Heterogeneous and Grid Computing2 Communication models u Modeling the performance of communications –Huge area –Two main communities »Network designers.
P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism Efficient Longest Common Subsequence Computation using Bulk-Synchronous.
Simulating a CRCW algorithm with an EREW algorithm Lecture 4 Efficient Parallel Algorithms COMP308.
RAM and Parallel RAM (PRAM). Why models? What is a machine model? – A abstraction describes the operation of a machine. – Allowing to associate a value.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Parallel Algorithms Sorting and more. Keep hardware in mind When considering ‘parallel’ algorithms, – We have to have an understanding of the hardware.
STRATEGIC NAMING: MULTI-THREADED ALGORITHM (Ch 27, Cormen et al.) Parallelization Four types of computing: –Instruction (single, multiple) per clock cycle.
Bulk Synchronous Parallel Processing Model Jamie Perkins.
Parallel ICA Algorithm and Modeling Hongtao Du March 25, 2004.
DEVICES AND COMMUNICATION BUSES FOR DEVICES NETWORK
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
-1.1- Chapter 2 Abstract Machine Models Lectured by: Nguyễn Đức Thái Prepared by: Thoại Nam.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
RAM, PRAM, and LogP models
LogP Model Motivation BSP Model Limited to BW of Network (g) and Load of PE Requires large load per super steps. Need Better Models for Portable Algorithms.
LogP and BSP models. LogP model Common MPP organization: complete machine connected by a network. LogP attempts to capture the characteristics of such.
Bulk Synchronous Processing (BSP) Model Course: CSC 8350 Instructor: Dr. Sushil Prasad Presented by: Chris Moultrie.
Parallel Computing Department Of Computer Engineering Ferdowsi University Hossain Deldari.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February 3, 2005 Session 7.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February 10, 2005 Session 9.
PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.
Communication and Computation on Arrays with Reconfigurable Optical Buses Yi Pan, Ph.D. IEEE Computer Society Distinguished Visitors Program Speaker Department.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
Design an MPI collective communication scheme A collective communication involves a group of processes. –Assumption: Collective operation is realized based.
Parallel Processing & Distributed Systems Thoai Nam Chapter 2.
Basic Linear Algebra Subroutines (BLAS) – 3 levels of operations Memory hierarchy efficiently exploited by higher level BLAS BLASMemor y Refs. FlopsFlops/
Data Structures and Algorithms in Parallel Computing Lecture 1.
2016/1/5Part I1 Models of Parallel Processing. 2016/1/5Part I2 Parallel processors come in many different varieties. Thus, we often deal with abstract.
Interconnection network network interface and a case study.
Static Process Scheduling
MPI implementation – collective communication MPI_Bcast implementation.
Complexity Measures for Parallel Computation. Problem parameters: nindex of problem size pnumber of processors Algorithm parameters: t p running time.
Module 3 Distributed Multiprocessor Architectures.
Background Computer System Architectures Computer System Software.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 28, 2005 Session 29.
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs in Parallel Machines Dr. Xiao Qin Auburn University
Overview Parallel Processing Pipelining
Distributed and Parallel Processing
CS5102 High Performance Computer Systems Thread-Level Parallelism
Lecture 22 review PRAM: A model developed for parallel machines
Parallel computation models
CMSC 611: Advanced Computer Architecture
Guoliang Chen Parallel Computing Guoliang Chen
Data Structures and Algorithms in Parallel Computing
Complexity Measures for Parallel Computation
Multiprocessors - Flynn’s taxonomy (1966)
COMP60621 Fundamentals of Parallel and Distributed Systems
Programming with Shared Memory Specifying parallelism
COMP60611 Fundamentals of Parallel and Distributed Systems
Presentation transcript:

Models of Parallel Computation W+A: Appendix D “LogP: Towards a Realistic Model of Parallel Computation”, PPOPP, May 1993 Alpern, B., L. Carter, and J. Ferrante, ``Modeling Parallel Computers as Memory Hierarchies,'' Programming Models for Massively Parallel Computers, Giloi, W. K., S. Jahnichen, and B. D. Shriver ed., IEEE Press, 1993. CSE 160/Berman

Computation Models Model provides underlying abstraction useful for analysis of costs, design of algorithms Serial computational models use RAM or TM as underlying models for algorithm design CSE 160/Berman

RAM [Random Access Machine] unalterable program consisting of optionally labeled instructions. memory is composed of a sequence of words, each capable of containing an arbitrary integer. an accumulator, referenced implicitly by most instructions. a read-only input tape a write-only output tape CSE 160/Berman

RAM Assumptions We assume all instructions take the same time to execute word-length unbounded the RAM has arbitrary amounts of memory arbitrary memory locations can be accessed in the same amount of time RAM provides an ideal model of a serial computer for analyzing the efficiency of serial algorithms. CSE 160/Berman

PRAM [Parallel Random Access Machine] PRAM provides an ideal model of a parallel computer for analyzing the efficiency of parallel algorithms. PRAM composed of P unmodifiable programs, each composed of optionally labeled instructions. a single shared memory composed of a sequence of words, each capable of containing an arbitrary integer. P accumulators, one associated with each program a read-only input tape a write-only output tape CSE 160/Berman

More PRAM PRAM is a synchronous, MIMD, shared memory parallel computer. Different protocols can be used for reading and writing shared memory. EREW (exclusive read, exclusive write) CREW (concurrent read, exclusive write) CRCW (concurrent read, concurrent write) -- requires additional protocol for arbitrating write conflicts PRAM can emulate a message-passing machine by logically dividing shared memory into private memories for the P processors. CSE 160/Berman

Broadcasting on a PRAM “Broadcast” can be done on CREW PRAM in O(1): Broadcaster sends value to shared memory Processors read from shared memory CSE 160/Berman

LogP machine model Model of distributed memory multicomputer Developed by [Culler, Karp, Patterson, etc.] Authors tried to model prevailing parallel architectures (circa 1993). Machine model represents prevalent MPP organization: machine constructed from at most a few thousand nodes, each node contains a powerful processor each node contains substantial memory interconnection structure has limited bandwidth interconnection structure has significant latency CSE 160/Berman

LogP parameters L: upper bound on latency incurred by sending a message from a source to a destination o: overhead, defined as the time the processor is engaged in sending or receiving a message, during which time it cannot do anything else g: gap, defined as the minimum time between consecutive message transmissions or receptions P: number of processor/memory modules CSE 160/Berman

LogP Assumptions network has finite capacity. at most ceiling(L/g) messages can be in transit from any one processor to any other at one time. asynchronous communication. latency and order of messages is unpredictable all messages are small context switching overhead is 0 (not modeled) multithreading (virtual processes) may be employed but only up to a limit of L/g virtual processors CSE 160/Berman

LogP notes All parameters measured in processor cycles Local operations take one cycle Messages are assumed to be small LogP was particularly well-suited to modeling CM-5. Not clear if the same correlation is found with other machines. CSE 160/Berman

LogP Analysis of PRAM Broadcasting Algorithm Broadcaster sends value to shared memory (we’ll assume the value is in P0’s memory) P Processors read from shared memory (other processors receive messages from P0) Time for P0 to send P messages = o + g (P-1) Maximum time for other processors to receive messages = o + (P-2)g + o + L + o CSE 160/Berman

Efficient Broadcasting in LogP Model Gap includes overhead time so overhead < gap P0 P1 P2 P3 P4 P5 P6 P7 time o g L CSE 160/Berman

Mapping induced by LogP Broadcasting algorithm on 8 processors 24 20 22 18 14 10 P5 P0 P1 P4 P6 P7 P2 P3 P0 P1 P2 P3 P4 P5 P6 P7 time o g L CSE 160/Berman

Analysis of LogP Broadcasting Algorithm to 7 Processors Time to receive one message from P0 for first processor (P5) is L+2o Time to receive message for last processor is max{3g+L+2o, 2g+L+2o, g+2L+4o, 4o+2L, g+4o+2L}=max{3g+L+2o, g+2L+4o} Compare to LogP analysis of PRAM Broadcast which is o + (P-2)g + o + L + o = 5g + 3o + L P0 P1 P2 P3 P4 P5 P6 P7 time o g L CSE 160/Berman

Scalable Performance LogP Broadcast utilizes tree structure to optimize broadcast time Tree depends on values of L,o,g,P Strategy is much more scalable (and ultimately more efficient) than PRAM Broadcast 24 20 22 18 14 10 P5 P0 P1 P4 P6 P7 P2 P3 CSE 160/Berman

Moral Analysis can be no better than underlying model. The more accurate the model, the more accurate the analysis. (This is why we use TM to determine undecidability but RAM to determine complexity.) CSE 160/Berman

Other Models used for Analysis BSP (Bulk Synchronous Parallel) Slight precursor and competitor to LogP PMH (Parallel Memory Hierarchy) Focuses on memory costs CSE 160/Berman

BSP[Bulk Synchronous Parallel] BSP proposed by Valiant BSP model consists of P processors, each with local memory Communication network for point-to-point message passing between processors Mechanism for synchronizing all or some of the processors at defined intervals CSE 160/Berman

BSP Programs BSP programs composed of supersteps In each superstep, processors execute L computational steps using locally stored data, and send and receive messages Processors synchronized at the end of the superstep (at which time all messages have been received) BSP programs can be implemented through mechanisms like Oxford BSP library (C routines for implementing BSP programs) and BSP-L. superstep synchronization CSE 160/Berman

BSP Parameters P: number of processors (with memory) L: synchronization periodicity g: communication cost s: processor speed (measured in number of time steps/second) Processor sends at most h messages and receives at most h messages in a single superstep (communication called an h-relation) superstep synchronization CSE 160/Berman

BSP Notes Complete program = set of supersteps Communication startup not modeled, g is for continuous traffic conditions Message size is one data word More than one process or thread can be executed by a processor. Generally assumed that computation and communication are not overlapped Time for a superstep = max number of local operations performed by any processor + g(max number of messages sent or received by a processor) + L CSE 160/Berman

BSP Analysis of PRAM Broadcast Algorithm: Broadcaster sends value to shared memory (we’ll assume the value is in P0’s memory) P Processors read from shared memory (other processors receive messages from P0) In BSP model, processors only allowed to send or receive at most h messages in a single superstep. Broadcast for more than h processors would require a tree structure If there were more than Lh processors, then a tree broadcast would require more than one superstep. How much time does it take for a P processor broadcast? CSE 160/Berman

BSP Analysis of PRAM Broadcast How much time does it take for a P processor broadcast? … h-ary tree CSE 160/Berman

PMH [Parallel Memory Hierarchy] Model PMH seeks to represent memory. Goal is to model algorithms so that good decisions can be made about where to allocate data during execution. Model represents costs of interprocessor communication and memory hierarchy traffic (e.g. between main memory and disk, between registers and cache). Proposed by Carter, Ferrante, Alpern CSE 160/Berman

PMH Model Computer is modeled as a tree of memory modules with the processors at the leaves. All data movement takes the form of block transfers between children and their parents. PMH is composed of a tree of modules all modules hold data leaf modules also perform computation data in a module is partitioned into blocks Each module has 4 parameters for each module CSE 160/Berman

Un-parameterized PMH Models for a Cluster of Workstations network Main memories Disks Caches ALU/registers Main memories Shared disk system Disks Caches network ALU/registers Bandwidth from processor to disk > bandwidth from processor to network Bandwidth between 2 processors > bandwidth to disk CSE 160/Berman

PMH Module Parameters Blocksize s_m tells how many bytes there are per block of m Blockcount n_m tells how many blocks fit in m Childcount c_m tells how many children m has Transfer time t_m tells how many cycles it takes to transfer a block between m and its parent Size of "node" and length of "edge" in PMH graph should correspond to blocksize, blockcount and transfer time Generally all modules at a given level of the tree will have the same parameters CSE 160/Berman

Summary Goal of parallel computation models is to provide a realistic representation of the costs of programming. Model provides algorithm designers and programmers a measure of algorithm complexity which helps them decide what is “good” (i.e. performance-efficient) Next up: Mapping and Scheduling CSE 160/Berman