Lecture 29 Fall 2011 Lecture 29: Parallel Programming Overview.

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

CSE431 Chapter 7A.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7A: Intro to Multiprocessor Systems Mary Jane Irwin (
SE-292 High Performance Computing
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
.1 Network Connected Multi’s [Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005]
1 Parallel Scientific Computing: Algorithms and Tools Lecture #3 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.
CSE431 L27 NetworkMultis.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 27. Network Connected Multi’s Mary Jane Irwin (
CSE431 Chapter 7C.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7C: Multiprocessor Network Topologies Mary Jane Irwin (
Multiprocessors CSE 4711 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor –Although.
Tuesday, September 12, 2006 Nothing is impossible for people who don't have to do it themselves. - Weiler.
Chapter 17 Parallel Processing.
Review: Bus Connected SMPs (UMAs)
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
Parallel Processing Architectures Laxmi Narayan Bhuyan
1 CSE SUNY New Paltz Chapter Nine Multiprocessors.
1  2004 Morgan Kaufmann Publishers Chapter 9 Multiprocessors.
MultiIntro.1 The Big Picture: Where are We Now? Processor Control Datapath Memory Input Output Input Output Memory Processor Control Datapath  Multiprocessor.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Lecture 22Fall 2006 Computer Systems Fall 2006 Lecture 22: Intro. to Multiprocessors Adapted from Mary Jane Irwin ( )
August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.
Parallel Computer Architecture and Interconnect 1b.1.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
.1 Intro to Multiprocessors. .2 The Big Picture: Where are We Now? Processor Control Datapath Memory Input Output Input Output Memory Processor Control.
Parallel Programming Sathish S. Vadhiyar. 2 Motivations of Parallel Computing Parallel Machine: a computer system with more than one processor Motivations.
CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/
Lecture 29Fall 2006 Computer Architecture Fall 2006 Lecture 29: Network Connected Multiprocessors Adapted from Mary Jane Irwin ( )
Outline Why this subject? What is High Performance Computing?
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
Parallel Computing Presented by Justin Reschke
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Tree-Based Networks Cache Coherence Dr. Xiao Qin Auburn University
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
Network Connected Multiprocessors
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Overview Parallel Processing Pipelining
Parallel Architecture
Higher Level Parallelism
Multiprocessor Systems
CMSC 611: Advanced Computer Architecture
CS5102 High Performance Computer Systems Thread-Level Parallelism
CSE 431 Computer Architecture Fall Lecture 27
Parallel Computers Definition: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”
Parallel Processing - introduction
Course Outline Introduction in algorithms and applications
CS 147 – Parallel Processing
Morgan Kaufmann Publishers
CMSC 611: Advanced Computer Architecture
Chapter 17 Parallel Processing
Outline Interconnection networks Processor arrays Multiprocessors
Multiprocessors - Flynn’s taxonomy (1966)
Multiple Processor Systems
CS 213: Parallel Processing Architectures
Parallel Processing Architectures
Distributed Systems CS
Chapter 4 Multiprocessors
Introduction, background, jargon
CSL718 : Multiprocessors 13th April, 2006 Introduction
Cluster Computers.
Presentation transcript:

Lecture 29 Fall 2011 Lecture 29: Parallel Programming Overview

Lecture 29 Fall 2011 Parallel Programming Paradigms --Various Methods  There are many methods of programming parallel computers. Two of the most common are message passing and data parallel. l Message Passing - the user makes calls to libraries to explicitly share information between processors. l Data Parallel - data partitioning determines parallelism l Shared Memory - multiple processes sharing common memory space l Remote Memory Operation - set of processes in which a process can access the memory of another process without its participation l Threads - a single process having multiple (concurrent) execution paths l Combined Models - composed of two or more of the above.  Note: these models are machine/architecture independent, any of the models can be implemented on any hardware given appropriate operating system support. An effective implementation is one which closely matches its target hardware and provides the user ease in programming.

Lecture 29 Fall 2011 Parallel Programming Paradigms: Message Passing The message passing model is defined as: l set of processes using only local memory l processes communicate by sending and receiving messages l data transfer requires cooperative operations to be performed by each process (a send operation must have a matching receive)  Programming with message passing is done by linking with and making calls to libraries which manage the data exchange between processors. Message passing libraries are available for most modern programming languages.

Lecture 29 Fall 2011 Parallel Programming Paradigms: Data Parallel  The data parallel model is defined as: l Each process works on a different part of the same data structure l Commonly a Single Program Multiple Data (SPMD) approach l Data is distributed across processors l All message passing is done invisibly to the programmer l Commonly built "on top of" one of the common message passing libraries  Programming with data parallel model is accomplished by writing a program with data parallel constructs and compiling it with a data parallel compiler.  The compiler converts the program into standard code and calls to a message passing library to distribute the data to all the processes.

Lecture 29 Fall 2011 Implementation of Message Passing: MPI  Message Passing Interface often called MPI.  A standard portable message-passing library definition developed in 1993 by a group of parallel computer vendors, software writers, and application scientists.  Available to both Fortran and C programs.  Available on a wide variety of parallel machines.  Target platform is a distributed memory system  All inter-task communication is by message passing.  All parallelism is explicit: the programmer is responsible for parallelism the program and implementing the MPI constructs.  Programming model is SPMD (Single Program Multiple Data)

Lecture 29 Fall 2011 Implementations: F90 / High Performance Fortran (HPF)  Fortran 90 (F90) - (ISO / ANSI standard extensions to Fortran 77).  High Performance Fortran (HPF) - extensions to F90 to support data parallel programming.  Compiler directives allow programmer specification of data distribution and alignment.  New compiler constructs and intrinsics allow the programmer to do computations and manipulations on data with different distributions.

Lecture 29 Fall 2011 Steps for Creating a Parallel Program 1. If you are starting with an existing serial program, debug the serial code completely 2. Identify the parts of the program that can be executed concurrently: l Requires a thorough understanding of the algorithm l Exploit any inherent parallelism which may exist. l May require restructuring of the program and/or algorithm. May require an entirely new algorithm. 3. Decompose the program: l Functional Parallelism l Data Parallelism l Combination of both 4. Code development l Code may be influenced/determined by machine architecture l Choose a programming paradigm l Determine communication l Add code to accomplish task control and communications 5. Compile, Test, Debug 6. Optimization l Measure Performance l Locate Problem Areas l Improve them

Lecture 29 Fall 2011 Amdahl’s Law  Speedup due to enhancement E is Speedup w/ E = Exec time w/o E Exec time w/ E  Suppose that enhancement E accelerates a fraction F (F 1) and the remainder of the task is unaffected ExTime w/ E = ExTime w/o E  ((1-F) + F/S) Speedup w/ E = 1 / ((1-F) + F/S)

Lecture 29 Fall 2011 Examples: Amdahl’s Law  Amdahl’s Law tells us that to achieve linear speedup with 100 processors (e.g., speedup of 100), none of the original computation can be scalar!  To get a speedup of 99 from 100 processors, the percentage of the original program that could be scalar would have to be 0.01% or less  What speedup could we achieve from 100 processors if 30% of the original program is scalar? Speedup w/ E = 1 / ((1-F) + F/S) = 1 / ( /100) = 1.4  Serial program/algorithm might need to be restructuring to allow for efficient parallelization.

Lecture 29 Fall 2011 Decomposing the Program  There are three methods for decomposing a problem into smaller tasks to be performed in parallel: Functional Decomposition, Domain Decomposition, or a combination of both  Functional Decomposition (Functional Parallelism) Functional Decomposition (Functional Parallelism) l Decomposing the problem into different tasks which can be distributed to multiple processors for simultaneous execution l Good to use when there is not static structure or fixed determination of number of calculations to be performed  Domain Decomposition (Data Parallelism) Domain Decomposition (Data Parallelism) l Partitioning the problem's data domain and distributing portions to multiple processors for simultaneous execution l Good to use for problems where: -data is static (factoring and solving large matrix or finite difference calculations) -dynamic data structure tied to single entity where entity can be subsetted (large multi- body problems) -domain is fixed but computation within various regions of the domain is dynamic (fluid vortices models) l There are many ways to decompose data into partitions to be distributed: -One Dimensional Data DistributionOne Dimensional Data Distribution –Block DistributionBlock Distribution –Cyclic DistributionCyclic Distribution -Two Dimensional Data DistributionTwo Dimensional Data Distribution –Block Block DistributionBlock Block Distribution –Block Cyclic DistributionBlock Cyclic Distribution –Cyclic Block DistributionCyclic Block Distribution

Lecture 29 Fall 2011 Functional Decomposing of a Program l Decomposing the problem into different tasks which can be distributed to multiple processors for simultaneous execution l Good to use when there is not static structure or fixed determination of number of calculations to be performed

Lecture 29 Fall 2011 Functional Decomposing of a Program

Lecture 29 Fall 2011 Domain Decomposition (Data Parallelism) l Partitioning the problem's data domain and distributing portions to multiple processors for simultaneous execution l There are many ways to decompose data into partitions to be distributed:

Lecture 29 Fall 2011 Summing 100,000 Numbers on 100 Processors sum = 0; for (i = 0; i<1000; i = i + 1) sum = sum + Al[i];/* sum local array subset  Start by distributing 1000 elements of vector A to each of the local memories and summing each subset in parallel  The processors then coordinate in adding together the sub sums ( Pn is the number of the processor, send(x,y ) sends value y to processor x, and receive() receives a value) half = 100; limit = 100; repeat half = (half+1)/2;/*dividing line if (Pn>= half && Pn<limit) send(Pn-half,sum); if (Pn<(limit/2)) sum = sum + receive(); limit = half; until (half == 1);/*final sum in P0’s sum

Lecture 29 Fall 2011 An Example with 10 Processors P0P1P2P3P4P5P6P7P8P9 sum half = 10

Lecture 29 Fall 2011 An Example with 10 Processors P0P1P2P3P4P5P6P7P8P9 P0P1P2P3P4 half = 10 half = 5 half = 3 half = 2 sum send receive P0P1P2 limit = 10 limit = 5 limit = 3 limit = 2 half = 1 P0P1P0 send receive send receive send receive

Lecture 29 Fall 2011 Domain Decomposition (Data Parallelism) l Partitioning the problem's data domain and distributing portions to multiple processors for simultaneous execution l There are many ways to decompose data into partitions to be distributed:

Lecture 29 Fall 2011 Cannon's Matrix Multiplication

Lecture 29 Fall 2011 Review: Multiprocessor Basics # of Proc Communication model Message passing8 to 2048 Shared address NUMA8 to 256 UMA2 to 64 Physical connection Network8 to 256 Bus2 to 36  Q1 – How do they share data?  Q2 – How do they coordinate?  Q3 – How scalable is the architecture? How many processors?

Lecture 29 Fall 2011 Review: Bus Connected SMPs (UMAs)  Caches are used to reduce latency and to lower bus traffic  Must provide hardware for cache coherence and process synchronization  Bus traffic and bandwidth limits scalability (<~ 36 processors) Processor Cache Single Bus Memory I/O Processor Cache

Lecture 29 Fall 2011 Network Connected Multiprocessors  Either a single address space (NUMA and ccNUMA) with implicit processor communication via loads and stores or multiple private memories with message passing communication with sends and receives l Interconnection network supports interprocessor communication Processor Cache Interconnection Network (IN) Memory

Lecture 29 Fall 2011 Communication in Network Connected Multi’s  Implicit communication via loads and stores l hardware designers have to provide coherent caches and process synchronization primitive l lower communication overhead l harder to overlap computation with communication l more efficient to use an address to remote data when demanded rather than to send for it in case it might be used (such a machine has distributed shared memory (DSM))  Explicit communication via sends and receives l simplest solution for hardware designers l higher communication overhead l easier to overlap computation with communication l easier for the programmer to optimize communication

Lecture 29 Fall 2011 Cache Coherency in NUMAs  For performance reasons we want to allow the shared data to be stored in caches  Once again have multiple copies of the same data with the same address in different processors l bus snooping won’t work, since there is no single bus on which all memory references are broadcast  Directory-base protocols l keep a directory that is a repository for the state of every block in main memory (which caches have copies, whether it is dirty, etc.) l directory entries can be distributed (sharing status of a block always in a single known location) to reduce contention l directory controller sends explicit commands over the IN to each processor that has a copy of the data

Lecture 29 Fall 2011 IN Performance Metrics  Network cost l number of switches l number of (bidirectional) links on a switch to connect to the network (plus one link to connect to the processor) l width in bits per link, length of link  Network bandwidth (NB) – represents the best case l bandwidth of each link * number of links  Bisection bandwidth (BB) – represents the worst case l divide the machine in two parts, each with half the nodes and sum the bandwidth of the links that cross the dividing line  Other IN performance issues l latency on an unloaded network to send and receive messages l throughput – maximum # of messages transmitted per unit time l # routing hops worst case, congestion control and delay

Lecture 29 Fall 2011 Bus IN  N processors, 1 switch ( ), 1 link (the bus)  Only 1 simultaneous transfer at a time l NB = link (bus) bandwidth * 1 l BB = link (bus) bandwidth * 1 Processor node Bidirectional network switch

Lecture 29 Fall 2011 Ring IN  If a link is as fast as a bus, the ring is only twice as fast as a bus in the worst case, but is N times faster in the best case  N processors, N switches, 2 links/switch, N links  N simultaneous transfers l NB = link bandwidth * N l BB = link bandwidth * 2

Lecture 29 Fall 2011 Fully Connected IN  N processors, N switches, N-1 links/switch, (N*(N-1))/2 links  N simultaneous transfers l NB = link bandwidth * (N*(N-1))/2 l BB = link bandwidth * (N/2) 2

Lecture 29 Fall 2011 Crossbar (Xbar) Connected IN  N processors, N 2 switches (unidirectional),2 links/switch, N 2 links  N simultaneous transfers l NB = link bandwidth * N l BB = link bandwidth * N/2

Lecture 29 Fall 2011 Hypercube (Binary N-cube) Connected IN  N processors, N switches, logN links/switch, (NlogN)/2 links  N simultaneous transfers l NB = link bandwidth * (NlogN)/2 l BB = link bandwidth * N/2 2-cube 3-cube

Lecture 29 Fall D and 3D Mesh/Torus Connected IN  N simultaneous transfers l NB = link bandwidth * 4N or link bandwidth * 6N l BB = link bandwidth * 2 N 1/2 or link bandwidth * 2 N 2/3  N processors, N switches, 2, 3, 4 (2D torus) or 6 (3D torus) links/switch, 4N/2 links or 6N/2 links

Lecture 29 Fall 2011 Fat Tree CDAB  Trees are good structures. People in CS use them all the time. Suppose we wanted to make a tree network.  Any time A wants to send to C, it ties up the upper links, so that B can't send to D. l The bisection bandwidth on a tree is horrible - 1 link, at all times  The solution is to 'thicken' the upper links. l More links as the tree gets thicker increases the bisection  Rather than design a bunch of N-port switches, use pairs

Lecture 29 Fall 2011 Fat Tree  N processors, log(N-1)*logN switches, 2 up + 4 down = 6 links/switch, N*logN links  N simultaneous transfers l NB = link bandwidth * NlogN l BB = link bandwidth * 4

Lecture 29 Fall 2011 SGI NUMAlink Fat Tree

Lecture 29 Fall 2011 IN Comparison  For a 64 processor system BusRingTorus6-cubeFully connected Network bandwidth 1 Bisection bandwidth 1 Total # of Switches 1 Links per switch Total # of links 1

Lecture 29 Fall 2011 IN Comparison  For a 64 processor system BusRing2D Torus 6-cubeFully connected Network bandwidth 1 Bisection bandwidth 1 Total # of switches 1 Links per switch Total # of links (bidi)

Lecture 29 Fall 2011 Network Connected Multiprocessors ProcProc Speed # ProcIN Topology BW/link (MB/sec) SGI OriginR fat tree800 Cray 3TEAlpha MHz2,0483D torus600 Intel ASCI RedIntel333MHz9,632mesh800 IBM ASCI White Power3375MHz8,192multistage Omega 500 NEC ESSX-5500MHz640*8640-xbar16000 NASA Columbia Intel Itanium2 1.5GHz512*20fat tree, Infiniband IBM BG/LPower PC GHz65,536*23D torus, fat tree, barrier

Lecture 29 Fall 2011 IBM BlueGene 512-node protoBlueGene/L Peak Perf1.0 / 2.0 TFlops/s180 / 360 TFlops/s Memory Size128 GByte16 / 32 TByte Foot Print9 sq feet2500 sq feet Total Power9 KW1.5 MW # Processors512 dual proc65,536 dual proc Networks3D Torus, Tree, Barrier Torus BW3 B/cycle

Lecture 29 Fall 2011 A BlueGene/L Chip 32K/32K L1 440 CPU Double FPU 32K/32K L1 440 CPU Double FPU 2KB L2 2KB L2 16KB Multiport SRAM buffer 4MB L3 ECC eDRAM 128B line 8-way assoc Gbit ethernet 3D torusFat treeBarrier DDR control 6 in, 6 out 1.6GHz 1.4Gb/s link 3 in, 3 out 350MHz 2.8Gb/s link 4 global barriers 144b DDR 256MB 5.5GB/s GB/s 5.5 GB/s 700 MHz

Lecture 29 Fall 2011 Networks of Workstations (NOWs) Clusters  Clusters of off-the-shelf, whole computers with multiple private address spaces  Clusters are connected using the I/O bus of the computers l lower bandwidth that multiprocessor that use the memory bus l lower speed network links l more conflicts with I/O traffic  Clusters of N processors have N copies of the OS limiting the memory available for applications  Improved system availability and expandability l easier to replace a machine without bringing down the whole system l allows rapid, incremental expandability  Economy-of-scale advantages with respect to costs

Lecture 29 Fall 2011 Commercial (NOW) Clusters ProcProc Speed # ProcNetwork Dell PowerEdge P4 Xeon3.06GHz2,500Myrinet eServer IBM SP Power41.7GHz2,944 VPI BigMacApple G52.3GHz2,200Mellanox Infiniband HP ASCI QAlpha GHz8,192Quadrics LLNL Thunder Intel Itanium21.4GHz1,024*4Quadrics BarcelonaPowerPC GHz4,536Myrinet

Lecture 29 Fall 2011 Summary  Flynn’s classification of processors - SISD, SIMD, MIMD l Q1 – How do processors share data? l Q2 – How do processors coordinate their activity? l Q3 – How scalable is the architecture (what is the maximum number of processors)?  Shared address multis – UMAs and NUMAs l Scalability of bus connected UMAs limited (< ~ 36 processors) l Network connected NUMAs more scalable l Interconnection Networks (INs) -fully connected, xbar -ring -mesh -n-cube, fat tree  Message passing multis  Cluster connected (NOWs) multis