Parallel Programming Sathish S. Vadhiyar. 2 Motivations of Parallel Computing Parallel Machine: a computer system with more than one processor Motivations.

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

1 Uniform memory access (UMA) Each processor has uniform access time to memory - also known as symmetric multiprocessors (SMPs) (example: SUN ES1000) Non-uniform.
Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Distributed Systems CS
SE-292 High Performance Computing
Super computers Parallel Processing By: Lecturer \ Aisha Dawood.
1 Parallel Scientific Computing: Algorithms and Tools Lecture #3 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.
Classification of Distributed Systems Properties of Distributed Systems n motivation: advantages of distributed systems n classification l architecture.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Multiple Processor Systems
Taxanomy of parallel machines. Taxonomy of parallel machines Memory – Shared mem. – Distributed mem. Control – SIMD – MIMD.
Parallel Computers Chapter 1
CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.
Multiprocessors CSE 4711 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor –Although.

Parallel Computing Platforms
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
Lecture 10 Outline Material from Chapter 2 Interconnection networks Processor arrays Multiprocessors Multicomputers Flynn’s taxonomy.
Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Machine.
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.
1 CSE SUNY New Paltz Chapter Nine Multiprocessors.
Introduction to Parallel Processing Ch. 12, Pg
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Chapter 5 Array Processors. Introduction  Major characteristics of SIMD architectures –A single processor(CP) –Synchronous array processors(PEs) –Data-parallel.
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
Course Outline Introduction in software and applications. Parallel machines and architectures –Overview of parallel machines –Cluster computers (Myrinet)
Interconnect Networks
Network Topologies Topology – how nodes are connected – where there is a wire between 2 nodes. Routing – the path a message takes to get from one node.
Parallel Computing Basic Concepts Computational Models Synchronous vs. Asynchronous The Flynn Taxonomy Shared versus Distributed Memory Interconnection.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
CS668- Lecture 2 - Sept. 30 Today’s topics Parallel Architectures (Chapter 2) Memory Hierarchy Busses and Switched Networks Interconnection Network Topologies.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
CSE Advanced Computer Architecture Week-11 April 1, 2004 engr.smu.edu/~rewini/8383.
August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Parallel Programming Sathish S. Vadhiyar Course Web Page:
Parallel Computer Architecture and Interconnect 1b.1.
Chapter 2 Parallel Architectures. Outline Interconnection networks Interconnection networks Processor arrays Processor arrays Multiprocessors Multiprocessors.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
CHAPTER 12 INTRODUCTION TO PARALLEL PROCESSING CS 147 Guy Wong page
Course Wrap-Up Miodrag Bolic CEG4136. What was covered Interconnection network topologies and performance Shared-memory architectures Message passing.
Chapter 6 Multiprocessor System. Introduction  Each processor in a multiprocessor system can be executing a different instruction at any time.  The.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
An Overview of Parallel Computing. Hardware There are many varieties of parallel computing hardware and many different architectures The original classification.
Parallel Processing Steve Terpe CS 147. Overview What is Parallel Processing What is Parallel Processing Parallel Processing in Nature Parallel Processing.
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500, clusters,
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
Data Structures and Algorithms in Parallel Computing Lecture 1.
Outline Why this subject? What is High Performance Computing?
Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.
Parallel Computing Erik Robbins. Limits on single-processor performance  Over time, computers have become better and faster, but there are constraints.
Super computers Parallel Processing
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
LECTURE #1 INTRODUCTON TO PARALLEL COMPUTING. 1.What is parallel computing? 2.Why we need parallel computing? 3.Why parallel computing is more difficult?
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Tree-Based Networks Cache Coherence Dr. Xiao Qin Auburn University
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
SE-292 High Performance Computing
Overview Parallel Processing Pipelining
Parallel Architecture
Multiprocessor Systems
Course Outline Introduction in algorithms and applications
Outline Interconnection networks Processor arrays Multiprocessors
Multiprocessors - Flynn’s taxonomy (1966)
Advanced Computer and Parallel Processing
Chapter 4 Multiprocessors
Presentation transcript:

Parallel Programming Sathish S. Vadhiyar

2 Motivations of Parallel Computing Parallel Machine: a computer system with more than one processor Motivations Faster Execution time due to non-dependencies between regions of code Presents a level of modularity Resource constraints. Large databases. Certain class of algorithms lend themselves Aggregate bandwidth to memory/disk. Increase in data throughput. Clock rate improvement in the past decade – 40% Memory access time improvement in the past decade – 10%

3 Parallel Programming and Challenges  Recall the advantages and motivation of parallelism  But parallel programs incur overheads not seen in sequential programs Communication delay Idling Synchronization

4 Challenges P0 P1 Idle time Computation Communication Synchronization

5 How do we evaluate a parallel program?  Execution time, T p  Speedup, S S(p, n) = T(1, n) / T(p, n) Usually, S(p, n) < p Sometimes S(p, n) > p (superlinear speedup)  Efficiency, E E(p, n) = S(p, n)/p Usually, E(p, n) < 1 Sometimes, greater than 1  Scalability – Limitations in parallel computing, relation to n and p.

6 Speedups and efficiency Ideal p S Practical p E

7 Limitations on speedup – Amdahl’s law  Amdahl's law states that the performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used.  Overall speedup in terms of fractions of computation time with and without enhancement, % increase in enhancement.  Places a limit on the speedup due to parallelism.  Speedup = 1 (f s + (f p /P))

Amdahl’s law Illustration Courtesy: S = 1 / (s + (1-s)/p)

Amdahl’s law analysis fP=1P=4P=8P=16P= For the same fraction, speedup numbers keep moving away from processor size. Thus Amdahl’s law is a bit depressing for parallel programming. In practice, the number of parallel portions of work has to be large enough to match a given number of processors.

Gustafson’s Law  Amdahl’s law – keep the parallel work fixed  Gustafson’s law – keep computation time on parallel processors fixed, change the problem size (fraction of parallel/sequential work) to match the computation time  For a particular number of processors, find the problem size for which parallel time is equal to the constant time  For that problem size, find the sequential time and the corresponding speedup  Thus speedup is scaled or scaled speedup. Also called weak speedup

11 Metrics (Contd..) NP=1P=4P=8P=16P= Table 5.1: Efficiency as a function of n and p.

12 Scalability  Efficiency decreases with increasing P; increases with increasing N  How effectively the parallel algorithm can use an increasing number of processors  How the amount of computation performed must scale with P to keep E constant  This function of computation in terms of P is called isoefficiency function.  An algorithm with an isoefficiency function of O(P) is highly scalable while an algorithm with quadratic or exponential isoefficiency function is poorly scalable

13 Parallel Program Models  Single Program Multiple Data (SPMD)  Multiple Program Multiple Data (MPMD) Courtesy:

14 Programming Paradigms  Shared memory model – Threads, OpenMP, CUDA  Message passing model – MPI

PARALLELIZATION

16 Parallelizing a Program Given a sequential program/algorithm, how to go about producing a parallel version Four steps in program parallelization 1.Decomposition Identifying parallel tasks with large extent of possible concurrent activity; splitting the problem into tasks 2.Assignment Grouping the tasks into processes with best load balancing 3.Orchestration Reducing synchronization and communication costs 4.Mapping Mapping of processes to processors (if possible)

17 Steps in Creating a Parallel Program

18 Orchestration  Goals Structuring communication Synchronization  Challenges Organizing data structures – packing Small or large messages? How to organize communication and synchronization ?

19 Orchestration  Maximizing data locality Minimizing volume of data exchange  Not communicating intermediate results – e.g. dot product Minimizing frequency of interactions - packing  Minimizing contention and hot spots Do not use the same communication pattern with the other processes in all the processes  Overlapping computations with interactions Split computations into phases: those that depend on communicated data (type 1) and those that do not (type 2) Initiate communication for type 1; During communication, perform type 2  Replicating data or computations Balancing the extra computation or storage cost with the gain due to less communication

20 Mapping  Which process runs on which particular processor? Can depend on network topology, communication pattern of processors On processor speeds in case of heterogeneous systems

21 Mapping  Static mapping Mapping based on Data partitioning  Applicable to dense matrix computations  Block distribution  Block-cyclic distribution Graph partitioning based mapping  Applicable for sparse matrix computations Mapping based on task partitioning

22 Based on Task Partitioning  Based on task dependency graph  In general the problem is NP complete

23 Mapping  Dynamic Mapping A process/global memory can hold a set of tasks Distribute some tasks to all processes Once a process completes its tasks, it asks the coordinator process for more tasks Referred to as self-scheduling, work- stealing

24 High-level Goals

PARALLEL ARCHITECTURE 25

26 Classification of Architectures – Flynn’s classification In terms of parallelism in instruction and data stream  Single Instruction Single Data (SISD): Serial Computers  Single Instruction Multiple Data (SIMD) - Vector processors and processor arrays - Examples: CM-2, Cray-90, Cray YMP, Hitachi 3600 Courtesy:

27 Classification of Architectures – Flynn’s classification  Multiple Instruction Single Data (MISD): Not popular  Multiple Instruction Multiple Data (MIMD) - Most popular - IBM SP and most other supercomputers, clusters, computational Grids etc. Courtesy:

28 Classification of Architectures – Based on Memory  Shared memory  2 types – UMA and NUMA UMA NUMA Examples: HP- Exemplar, SGI Origin, Sequent NUMA-Q Courtesy:

29 Classification 2: Shared Memory vs Message Passing  Shared memory machine: The n processors share physical address space Communication can be done through this shared memory  The alternative is sometimes referred to as a message passing machine or a distributed memory machine P P PP PP P Interconnect Main Memory P P PP PP P Interconnect MMMMMMM

30 Shared Memory Machines The shared memory could itself be distributed among the processor nodes Each processor might have some portion of the shared physical address space that is physically close to it and therefore accessible in less time Terms: NUMA vs UMA architecture  Non-Uniform Memory Access  Uniform Memory Access

31 Classification of Architectures – Based on Memory  Distributed memory Courtesy: Recently multi-cores Yet another classification – MPPs, NOW (Berkeley), COW, Computational Grids

32 Parallel Architecture: Interconnection Networks  An interconnection network defined by switches, links and interfaces Switches – provide mapping between input and output ports, buffering, routing etc. Interfaces – connects nodes with network

33 Parallel Architecture: Interconnections  Indirect interconnects: nodes are connected to interconnection medium, not directly to each other Shared bus, multiple bus, crossbar, MIN  Direct interconnects: nodes are connected directly to each other Topology: linear, ring, star, mesh, torus, hypercube Routing techniques: how the route taken by the message from source to destination is decided  Network topologies Static – point-to-point communication links among processing nodes Dynamic – Communication links are formed dynamically by switches

34 Interconnection Networks  Static Bus – SGI challenge Completely connected Star Linear array, Ring (1-D torus) Mesh – Intel ASCI Red (2-D), Cray T3E (3-D), 2DTorus k-d mesh: d dimensions with k nodes in each dimension Hypercubes – 2-logp mesh – e.g. many MIMD machines Trees – our campus network  Dynamic – Communication links are formed dynamically by switches Crossbar – Cray X series – non-blocking network Multistage – SP2 – blocking network.  For more details, and evaluation of topologies, refer to book by Grama et al.

35 Indirect Interconnects Shared bus Multiple bus Crossbar switch Multistage Interconnection Network 2x2 crossbar

36 Direct Interconnect Topologies Linea r Ring Star Mesh 2D Torus Hypercube (binary n- cube) n= 2 n= 3

37 Evaluating Interconnection topologies  Diameter – maximum distance between any two processing nodes Full-connected – Star – Ring – Hypercube -  Connectivity – multiplicity of paths between 2 nodes. Maximum number of arcs to be removed from network to break it into two disconnected networks Linear-array – Ring – 2-d mesh – 2-d mesh with wraparound – D-dimension hypercubes – 1 2 p/2 logP d

38 Evaluating Interconnection topologies  bisection width – minimum number of links to be removed from network to partition it into 2 equal halves Ring – P-node 2-D mesh - Tree – Star – Completely connected – Hypercubes - 2 Root(P) 1 1 P 2 /4 P/2

39 Evaluating Interconnection topologies  channel width – number of bits that can be simultaneously communicated over a link, i.e. number of physical wires between 2 nodes  channel rate – performance of a single physical wire  channel bandwidth – channel rate times channel width  bisection bandwidth – maximum volume of communication between two halves of network, i.e. bisection width times channel bandwidth

40 X: 0 Shared Memory Architecture: Caches X: 0 Read X X: 0 Read X Cache hit: Wrong data!! P1 P2 Write X=1 X: 1

41 Cache Coherence Problem  If each processor in a shared memory multiple processor machine has a data cache Potential data consistency problem: the cache coherence problem Shared variable modification, private cache  Objective: processes shouldn’t read `stale’ data  Solutions Hardware: cache coherence mechanisms

42 Cache Coherence Protocols  Write update – propagate cache line to other processors on every write to a processor  Write invalidate – each processor gets the updated cache line whenever it reads stale data  Which is better?

43 X: 0 Invalidation Based Cache Coherence X: 0 Read X X: 0 Read X Invalidat e P1 P2 Write X=1 X: 1

44 Cache Coherence using invalidate protocols  3 states associated with data items Shared – a variable shared by 2 caches Invalid – another processor (say P0) has updated the data item Dirty – state of the data item in P0  Implementations Snoopy  for bus based architectures  shared bus interconnect where all cache controllers monitor all bus activity  There is only one operation through bus at a time; cache controllers can be built to take corrective action and enforce coherence in caches  Memory operations are propagated over the bus and snooped Directory-based  Instead of broadcasting memory operations to all processors, propagate coherence operations to relevant processors  A central directory maintains states of cache blocks, associated processors  Implemented with presence bits

 END