Shared Memory and Message Passing

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

SE-292 High Performance Computing
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Super computers Parallel Processing By: Lecturer \ Aisha Dawood.
Multiple Processor Systems
Taxanomy of parallel machines. Taxonomy of parallel machines Memory – Shared mem. – Distributed mem. Control – SIMD – MIMD.
CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.
Multiprocessors CSE 4711 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor –Although.
Multiple Processor Systems Chapter Multiprocessors 8.2 Multicomputers 8.3 Distributed systems.
Advanced Topics in Algorithms and Data Structures An overview of the lecture 2 Models of parallel computation Characteristics of SIMD models Design issue.
Chapter 10 in textbook. Sorting Algorithms
Models of Parallel Computation
Sorting Algorithms CS 524 – High-Performance Computing.
1 Lecture 11 Sorting Parallel Computing Fall 2008.
NUMA Mult. CSE 471 Aut 011 Interconnection Networks for Multiprocessors Buses have limitations for scalability: –Physical (number of devices that can be.
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
Multiple Processor Systems 8.1 Multiprocessors 8.2 Multicomputers 8.3 Distributed systems.
Topic Overview One-to-All Broadcast and All-to-One Reduction
1 CSE SUNY New Paltz Chapter Nine Multiprocessors.
Introduction to Parallel Processing Ch. 12, Pg
Mapping Techniques for Load Balancing
1 Sorting Algorithms - Rearranging a list of numbers into increasing (strictly non-decreasing) order. ITCS4145/5145, Parallel Programming B. Wilkinson.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
Network Topologies Topology – how nodes are connected – where there is a wire between 2 nodes. Routing – the path a message takes to get from one node.
Synchronization (Barriers) Parallel Processing (CS453)
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
Basic Communication Operations Based on Chapter 4 of Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis and Vipin Kumar These.
MIMD Shared Memory Multiprocessors. MIMD -- Shared Memory u Each processor has a full CPU u Each processors runs its own code –can be the same program.
1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors.
1 Parallel Sorting Algorithms. 2 Potential Speedup O(nlogn) optimal sequential sorting algorithm Best we can expect based upon a sequential sorting algorithm.
August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Parallel Computer Architecture and Interconnect 1b.1.
Outline  introduction  Sorting Networks  Bubble Sort and its Variants 2.
1 Dynamic Interconnection Networks Miodrag Bolic.
Chapter 6 Multiprocessor System. Introduction  Each processor in a multiprocessor system can be executing a different instruction at any time.  The.
شبکه های میان ارتباطی 1 به نام خدا دکتر محمد کاظم اکبری مرتضی سرگلزایی جوان
Anshul Kumar, CSE IITD CSL718 : Multiprocessors Interconnection Mechanisms Performance Models 20 th April, 2006.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster computers –shared memory model ( access nsec) –message passing multiprocessor.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
Anshul Kumar, CSE IITD ECE729 : Advanced Computer Architecture Lecture 27, 28: Interconnection Mechanisms In Multiprocessors 29 th, 31 st March, 2010.
Birds Eye View of Interconnection Networks
1. 2 Sorting Algorithms - rearranging a list of numbers into increasing (strictly nondecreasing) order.
Message-Passing Computing Chapter 2. Programming Multicomputer Design special parallel programming language –Occam Extend existing language to handle.
Data Structures and Algorithms in Parallel Computing Lecture 1.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:
2016/1/6Part I1 A Taste of Parallel Algorithms. 2016/1/6Part I2 We examine five simple building-block parallel operations and look at the corresponding.
Outline Why this subject? What is High Performance Computing?
CSCI-455/552 Introduction to High Performance Computing Lecture 23.
Super computers Parallel Processing
HYPERCUBE ALGORITHMS-1
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.
August 13, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 11: Multiprocessors: Uniform Memory Access * Jeremy R. Johnson Monday,
Basic Communication Operations Carl Tropper Department of Computer Science.
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Distributed-Memory or Graph Models
Overview Parallel Processing Pipelining
Distributed and Parallel Processing
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Interconnection Networks (Part 2) Dr.
Dynamic connection system
Overview Parallel Processing Pipelining
Multiple Processor Systems
High Performance Computing & Bioinformatics Part 2 Dr. Imad Mahgoub
Advanced Computer and Parallel Processing
Advanced Computer and Parallel Processing
Parallel Sorting Algorithms
Programming Parallel Computers
Presentation transcript:

Shared Memory and Message Passing W+A 1.2.1, 1.2.2, 1.2.3, 2.1, 2.1.2, 2.1.3, 8.3, 9.2.1, 9.2.2 Akl 2.5.2 CSE 160/Berman

Models for Communication Parallel program = program composed of tasks (processes) which communicate to accomplish an overall computational goal Two prevalent models for communication: Message passing (MP) Shared memory (SM) This lecture will focus on MP and SM computing CSE 160/Berman

Message Passing Communication Processes in message passing program communicate by passing messages Basic message passing primitives Send(parameter list) Receive(parameter list) Parameters depend on the software and can be complex A B CSE 160/Berman

Flavors of message passing Synchronous used for routines that return when the message transfer has been completed Synchronous send waits until the complete message can be accepted by the receiving process before sending the message (send suspends until receive) Synchronous receive will wait until the message it is expecting arrives (receive suspends until message sent) Also called blocking request to send A B acknowledgement message CSE 160/Berman

Nonblocking message passing Nonblocking sends return whether or not the message has been received If receiving processor not ready, message may be stored in message buffer Message buffer used to hold messages being sent by A prior to being accepted by receive in B MPI: routines that use a message buffer and return after their local actions complete are blocking (even though message transfer may not be complete) Routines that return immediately are non-blocking A B message buffer CSE 160/Berman

Architectural support for MP Interconnection network should provide connectivity, low latency, high bandwidth Many interconnection networks developed over last 2 decades Hypercube Mesh, torus Ring, etc. Interconnection Network … Processor Local memory Basic Message Passing Multicomputer CSE 160/Berman

Shared Memory Communication Processes in shared memory program communicate by accessing shared variables and data structures Basic shared memory primitives Read to a shared variable Write to a shared variable Interconnection Media … Processors memories Basic Shared Memory Multiprocessor CSE 160/Berman

Accessing Shared Variables Conflicts may arise if multiple processes want to write to a shared variable at the same time. Programmer, language, and/or architecture must provide means of resolving conflicts Shared variable x +1 proc. A +1 proc. B Process A,B: read x compute x+1 write x CSE 160/Berman

Architectural Support for SM 4 basic types of interconnection media: Bus Crossbar switch Multistage network Interconnection network with distributed shared memory CSE 160/Berman

Limited Scalability Media I Bus Bus acts as a “party line” between processors and shared memories Bus provides uniform access to shared memory (UMA) When bus saturates, performance of system degrades For this reason, bus-based systems do not scale to more than 30-40 processors [Sequent Symmetry, Balance] processor … memory bus CSE 160/Berman

Limited Scalability Media II Crossbar Crossbar switch connects m processors and n memories with distinct paths between each processor/memory pair Crossbar provides uniform access to shared memory (UMA) O(mn) switches required for m processors and n memories Crossbar scalable in terms of performance but not in terms of cost, used for basic switching mechanism in SP2 P1 M1 P4 P5 P3 P2 M2 M3 M4 M5 CSE 160/Berman

Multistage Networks Multistage networks provide more scalable performance than bus but less costly to scale than crossbar Typically max{logn,logm} stages connect n processors and m shared memories “Omega” networks (butterfly, shuffle-exchange) commonly used for multistage network Multistage network used for CM-5 (fat-tree connects processor/memory pairs), BBN Butterfly (butterfly), IBM RP3 (omega) P1 P4 P5 P3 P2 M1 M2 M3 M4 M5 Stage 1 Stage 2 Stage k … CSE 160/Berman

Omega Networks Butterfly multistage Shuffle multistage Used for BBN Butterfly, TC2000 Shuffle multistage Used for RP3, SP2 high performance switch 4 D C 1 B 3 2 A CSE 160/Berman

Fat-tree Interconnect Bandwidth is increased towards the root Used for data network for CM-5 (MIMD MPP) 4 leaf nodes, internal nodes have 2 or 4 children To route from leaf A to leaf B, pick random switch C in the least common ancestor fat node of A and B, take unique tree route from A to C and from C to B Binary fat-tree in which all internal nodes have two children CSE 160/Berman

Distributed Shared Memory Memory is physically distributed but programmed as shared memory Programmers find shared memory paradigm desirable Shared memory distributed among processors, accesses may be sent as messages Access to local memory and global shared memory creates NUMA (non-uniform memory access architectures) BBN butterfly is NUMA shared memory multiprocessor Interconnection Network … Processor and local memory Shared memory P M … BBN butterfly interconnect CSE 160/Berman

Alphabet Soup Terms for shared memory multiprocessors NUMA = non-uniform memory access BBN Butterfly, Cray T3E, Origin 2000 UMA = uniform memory access Sequent, Sun HPC 1000 COMA = cache-only memory access KSR (NORMA = no remote memory access message-passing MPPs ) CSE 160/Berman

Using both SM and MP together Common for platforms to support one model at a time – SM or MP Clusters of SMPs may be effectively programmed using both SM and MP SM used within a multiple processor machine/node MP used between nodes CSE 160/Berman

SM Program: Prefix Sums Problem: Given n processes {P_i} and n datum {a_i}, want to compute the prefix sums {(a_1+…+ a_j )= A_1i} such that A_1i is in P_i upon termination of the algorithm. We’ll look at an O(log n) SM parallel algorithm which computes the prefix sums of n datum on n processors CSE 160/Berman

Data Movement for Prefix Sums Algorithm Aij = a_i + a_i+1 + … + a_j P1 P2 P3 P4 P5 P6 P7 P8 a_1 a_2 a_3 a_4 a_5 a_6 a_7 a_8 A11 A12 A23 A34 A45 A56 A67 A78 A13 A14 A25 A36 A47 A58 A15 A16 A17 A18 Initial values in shared memories Prefix sums in shared memories CSE 160/Berman

Pseudo-code for Prefix Sums Algorithm Procedure ALLSUMS(a_1,…,a_n) Initialize P_i with data a_i=Aii for j=0 to (log n) –1 do forall i = 2^j +1 to n do (parallel for) Processor P_i: (i) obtains contents of P_i-2^j through shared memory and (ii) replaces contents of P_i with contents of P_i-2^j + current contents of P_i end forall end for Aik A(k+1)j Aij CSE 160/Berman

Programming Issues Algorithm assumes that all additions with the same offset (i.e. for each level) are performed at the same time Need some way of tagging or synchronizing computations May be cost-effective to do a barrier synchronization (all processors must reach a “barrier before proceeding to the next level ) between levels For this algorithm, there are no write conflicts within a level since one of the values is already in the shared variable, the other value need only be summed with the existing value If two values must be written with existing variable, we would need to establish a well-defined protocol for handling conflicting writes CSE 160/Berman

MP Program: Sorting Problem: Sorting a list of numbers/keys (rearranging them so as to be in non-decreasing order) Basic sorting operation: compare/exchange (compare/swap) In serial computation (RAM) model, optimal sorting for n keys is O(nlogn) P1 P2 Send value from P1 Send min of values to P1 Compare P1 and P2 values, retain max of P1 and P2 values in P2 [1 “active”, 1 “passive” Processor] CSE 160/Berman

Odd-Even Transposition Sort Parallel version of bubblesort – many compare-exchanges done simultaneously Algorithm consists of Odd Phases and Even Phases In even phase, even-numbered processes exchange numbers (via messages) with their right neighbor In odd phase, odd-numbered processes exchange numbers (via messages) with their right neighbor Algorithm alternates odd phase and even phase for O(n) iterations CSE 160/Berman

Odd-Even Transposition Sort Data Movement P0 P1 P2 P3 P4 T=1 T=2 T=3 T=4 T=5 General Pattern for n=5 CSE 160/Berman

Odd-Even Transposition Sort Example 3 10 4 8 1 T=0 P0 P1 P2 P3 P4 T=1 3 10 4 8 1 T=2 3 4 10 1 8 T=3 3 4 1 10 8 T=4 3 1 4 8 10 T=5 1 3 4 8 10 General Pattern for n=5 CSE 160/Berman

Odd-Even Transposition Code Compare-exchange accomplished through message passing Odd Phase Even Phase P_i = 0, 2,4,…,n-2 recv(&A, P_i+1); send(&B, P_i+1); if (A<B) B=A; P_i = 1,3,5,…,n-1 send(&A, P_i-1); recv(&B, P_i-1); if (A<B) A=B; P0 P1 P2 P3 P4 P_i = 1,3,5,…,n-3 recv(&A, P_i+1); send(&B, P_i+1); if (A<B) B=A; P_i = 2,4,6,…,n-2 send(&A, P_i-1); recv(&B, P_i-1); if (A<B) A=B; P0 P1 P2 P3 P4 CSE 160/Berman

Programming Issues Algorithm that odd phases and even phases done in sequence – how to synchronize? Synchronous execution Need to have barrier between phases Barrier synchronization costs may be high Asynchronous execution Need to tag iteration, phase so that correct values combined Program may be implemented as SPMD (single program, multiple data) [see HW] CSE 160/Berman

Programming Issues Algorithm must be mapped to underlying platform If communication costs >> computation costs, it may be more cost-effective to map multiple processes to a single processor and bundle communication granularity (ratio of time required for a basic communication operation to the time required for a basic computation) of underlying platform required to determine best mapping Processor A Processor B Processor A Processor B P0 P1 P2 P3 P0 P1 P2 P3 P4 P4 CSE 160/Berman

Is Odd-Even Transposition Sort Optimal? What is optimal? An algorithm is optimal if there is a lower bound for the problem it addresses with respect to the basic operation being counted which equals the upper bound given by the algorithm’s complexity function, i.e. lower bound = upper bound CSE 160/Berman

Odd-Even Transposition Sort is optimal on linear array Upper bound = O(n) Lower bound = O(n) Consider sorting algorithms on linear array where basic operation being counted is compare-exchange If minimum key is in rightmost array element, it must move throughout the course of any algorithm to the leftmost array element Compare-exchange operations only allow keys to move one process to the left each time-step. Therefore, any sorting algorithm requires at least O(n) time-steps to move the minimum key to the first position 8 10 5 7 1 CSE 160/Berman

Optimality O(nlogn) lower bound for serial sorting algorithms on RAM wrt comparisons O(n) lower bound for parallel sorting algorithms on linear array wrt compare-exchange No conflict since the platforms/computing environments are different, apples vs. oranges Note that in parallel world, different lower bounds for sorting in different environments O(logn) lower bound on PRAM (Parallel RAM) O(n^1/2) lower bound on 2D array, etc. CSE 160/Berman

Optimality on a 2D Array Same argument as linear array works for lower bound If we want to exchange X and Y, we must do at least O( ) steps on an X array upper bound: Thompson and Kung “Sorting on a Mesh-Connected Parallel Computer” CACM, Vol 20, (April), 1977 Y X CSE 160/Berman