Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, © 2004 Pearson Education Inc. All rights reserved. 1.1 Parallel Computers Chapter 1
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, © 2004 Pearson Education Inc. All rights reserved. 1.2 Demand for Computational Speed Continual demand for greater computational speed from a computer system than is currently possible Computations must be completed within a “reasonable” time period. The only way to achieve specific computational goals is to use multiple processors simultaneously for both speed and memory. –Areas requiring great computational speed include numerical modeling and simulation of scientific and engineering problems.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, © 2004 Pearson Education Inc. All rights reserved. Parallel computing model fits real world The universe is inherently parallel, so parallel models fit it best. Physical processes occur in parallel: –weather, galaxy formation, epidemics, traffic jams,... Social/work processes occur in parallel: –ant colonies, wolf packs, assembly lines, tutorials, 1.3
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, © 2004 Pearson Education Inc. All rights reserved. Applications 1.4 Areas requiring great computational speed include numerical modeling and simulation of scientific and engineering problems.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, © 2004 Pearson Education Inc. All rights reserved. 1.5 Grand Challenge Problems One that cannot be solved in a reasonable amount of time with today’s computers. Obviously, an execution time of 10 years is always unreasonable. Examples Modeling car crash Modeling large DNA structures Global weather forecasting Modeling motion of astronomical bodies.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, © 2004 Pearson Education Inc. All rights reserved. CRASH SIMULATION A greatly simplified model, based on parallelizing crash simulation for Ford Motor Company. Such simulations save a significant amount of money and time compared to testing real cars. This example illustrates various phenomena which are common to many simulations and other large- scale applications. 1.6
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, © 2004 Pearson Education Inc. All rights reserved. Finite Element Representation Car is modeled by a triangulated surface (the elements). The simulation consists of modeling the movement of the elements during each time step, incorporating the forces on them to determine their new position. In each time step, the movement of each element depends on its interaction with the other elements that it is physically adjacent to. 1.7
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, © 2004 Pearson Education Inc. All rights reserved. Car model in multiple elements 1.8
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, © 2004 Pearson Education Inc. All rights reserved. Serial algorithm 1.9
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, © 2004 Pearson Education Inc. All rights reserved. Distribution of car elements 1.10
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, © 2004 Pearson Education Inc. All rights reserved. Parallel algorithm 1.11
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, © 2004 Pearson Education Inc. All rights reserved. Ghost cells 1.12
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, © 2004 Pearson Education Inc. All rights reserved Weather Forecasting Atmosphere modeled by dividing it into 3- dimensional cells. Calculations of each cell repeated many times to model passage of time.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, © 2004 Pearson Education Inc. All rights reserved Global Weather Forecasting Example Suppose whole global atmosphere divided into cells of size 1 mile 1 mile 1 mile to a height of 10 miles (10 cells high) - about 5 10 8 cells. Suppose each calculation requires 200 floating point operations. In one time step, floating point operations necessary. To forecast the weather over 7 days using 1-minute intervals, a computer operating at 1Gflops (10 9 floating point operations/s) takes 10 6 seconds or over 10 days. To perform calculation in 5 minutes requires computer operating at 3.4 Tflops (3.4 floating point operations/sec).
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, © 2004 Pearson Education Inc. All rights reserved Parallel Computing Using more than one computer, or a computer with more than one processor, to solve a problem. Motives Usually faster computation - very simple idea - that n computers operating simultaneously can achieve the result n times faster - it will not be n times faster for various reasons. Other motives include: fault tolerance, larger amount of memory available,...
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, © 2004 Pearson Education Inc. All rights reserved Top 5 supercomputer as of Nov, 2009 petaflops
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, © 2004 Pearson Education Inc. All rights reserved. Evaluating the performance An important component of effective parallel computing is determining whether the program is performing well. If it is not running efficiently, or cannot be scaled to the target number of processors, then one needs to determine the causes of the problem and develop better approaches. 1.17
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, © 2004 Pearson Education Inc. All rights reserved. Some criteria 1.18
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, © 2004 Pearson Education Inc. All rights reserved Speedup Factor where t s is execution time on a single processor and t p is execution time on a multiprocessor. S(p) gives increase in speed by using multiprocessor. Use best sequential algorithm with single processor system. Underlying algorithm for parallel implementation might be (and is usually) different. S(p) = Execution time using one processor (best sequential algorithm) Execution time using a multiprocessor with p processors tsts tptp
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, © 2004 Pearson Education Inc. All rights reserved Speedup factor can also be cast in terms of computational steps: Can also extend time complexity to parallel computations. S(p) = Number of computational steps using one processor Number of parallel computational steps with p processors
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, © 2004 Pearson Education Inc. All rights reserved. Observed Speedup 1.21
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, © 2004 Pearson Education Inc. All rights reserved Maximum Speedup Amdahl’s law Serial section Parallelizable sections (a) One processor (b) Multiple processors ft s (1- f)t s t s - f)t s /p t p p processors
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, © 2004 Pearson Education Inc. All rights reserved Speedup factor is given by: This equation is known as Amdahl’s law
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, © 2004 Pearson Education Inc. All rights reserved Speedup against number of processors f = 20% f = 10% f = 5% f = 0% Number of processors,p
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, © 2004 Pearson Education Inc. All rights reserved Maximum Speedup Even with infinite number of processors, maximum speedup limited to 1/f. Example With only 5% of computation being serial, maximum speedup is 20, irrespective of number of processors.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, © 2004 Pearson Education Inc. All rights reserved Maximum Speedup If all components are parallelizable, Maximum speedup is usually p with p processors (linear speedup). Possible to get superlinear speedup (greater than p) but usually a specific reason such as:
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, © 2004 Pearson Education Inc. All rights reserved Superlinear Speedup example - Searching (a) Searching each sub-space sequentially t s t s /p StartTime t Solution found xt s /p Sub-space search x indeterminate
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, © 2004 Pearson Education Inc. All rights reserved (b) Searching each sub-space in parallel Solution found t
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, © 2004 Pearson Education Inc. All rights reserved Speed-up then given by S(p)S(p) x t s p t + t =
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, © 2004 Pearson Education Inc. All rights reserved Worst case for sequential search when solution found in last sub-space search. Then parallel version offers greatest benefit, i.e. S(p)S(p) p1– p t s t + t = as t tends to zero
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, © 2004 Pearson Education Inc. All rights reserved Least advantage for parallel version when solution found in first sub-space search of the sequential search, i.e. Actual speed-up depends upon which subspace holds solution but could be extremely large. S(p) = t t = 1
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, © 2004 Pearson Education Inc. All rights reserved Types of Parallel Computers Two principal types: Shared memory multiprocessor Distributed memory multicomputer
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, © 2004 Pearson Education Inc. All rights reserved Shared Memory Multiprocessor
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, © 2004 Pearson Education Inc. All rights reserved Conventional Computer Consists of a processor executing a program stored in a (main) memory: Each main memory location located by its address. Addresses start at 0 and extend to 2 b - 1 when there are b bits (binary digits) in address. Main memory Processor Instructions (to processor) Data (to or from processor)
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, © 2004 Pearson Education Inc. All rights reserved Shared Memory Multiprocessor System Natural way to extend single processor model - have multiple processors connected to multiple memory modules, such that each processor can access any memory module : Processors Interconnection network Memory module One address space Global memory space, accessible by all processors. Processors may have local memory to hold copies of some global memory. Consistency of copies is usually maintained by hardware.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, © 2004 Pearson Education Inc. All rights reserved Simplistic view of a small shared memory multiprocessor Examples: Dual Pentiums Quad Pentiums ProcessorsShared memory Bus
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, © 2004 Pearson Education Inc. All rights reserved Quad Pentium Shared Memory Multiprocessor Processor L2 Cache Bus interface L1 cache Processor L2 Cache Bus interface L1 cache Processor L2 Cache Bus interface L1 cache Processor L2 Cache Bus interface L1 cache Memory controller Memory I/O interface I/O bus Processor/ memory bus Shared memory
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, © 2004 Pearson Education Inc. All rights reserved. Pros and Cons of shared memory system 1.38 Disadvantages
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, © 2004 Pearson Education Inc. All rights reserved Programming Shared Memory Multiprocessors Approaches 1. Threads - programmer decomposes program into individual parallel sequences, (threads), each being able to access variables declared outside threads. Example Pthreads and Java threads 2. Sequential programming language with preprocessor compiler directives to declare shared variables and specify parallelism. Example OpenMP - industry standard added to C/C++ and Fortran- needs OpenMP compiler
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, © 2004 Pearson Education Inc. All rights reserved More approaches 3. Sequential programming language with added syntax to declare shared variables and specify parallelism. Example UPC (Unified Parallel C) - needs a UPC compiler. 4. Parallel programming language with syntax to express parallelism (constructs and statements)- compiler creates executable code for each processor (not now common) 5. Sequential programming language and ask parallelizing compiler to convert it into parallel executable code. - also not now common
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, © 2004 Pearson Education Inc. All rights reserved Distributed Memory System Complete computers connected through an interconnection network: Processor Interconnection network Local Computers Messages memory
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, © 2004 Pearson Education Inc. All rights reserved. Pros and Cons of Distributed Memory System 1.42 Disadvantages:
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, © 2004 Pearson Education Inc. All rights reserved Distributed Shared Memory Messaging-passing is not as attractive for programmers as the shared memory paradigm since data cannot be shared, must be copied/transferred. Making main memory of group of interconnected computers look as though a single memory with single address space. Then can use shared memory programming techniques. Message-passing occurs but in some automated way that hides the fact that memory is distributed. Processor Interconnection network Shared Computers Messages memory
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, © 2004 Pearson Education Inc. All rights reserved. Message Passing 1.44
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, © 2004 Pearson Education Inc. All rights reserved. Communication Speed 1.45 On most distributed memory systems, passing messages are relatively slow, with startup (latency) times taking thousands of cycles (and far more for many clusters). Typically, once the message has started, the additional time per byte (bandwidth) is relatively small.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, © 2004 Pearson Education Inc. All rights reserved. Reducing Latency Reducing the effect of high latency often important for performance. Some useful approaches: Reduce the number of messages by mapping communicating entities onto the same processor. Combine messages having the same sender and destination. If processor P has data needed by processor Q, have P send to Q, rather than Q first requesting it. P should send as soon as data ready, Q should read as late as possible to increase probability data has arrived. 1.46
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, © 2004 Pearson Education Inc. All rights reserved. Blocking If messages blocking, i.e., if processor can’t proceed until the message is finished, then can reach deadlock, where no processor can proceed. Example: Processor A sends message to B while B sends to A. If blocking sends, neither finishes until the other finishes receiving, but neither starts receiving until send finished. This can be avoided by A doing send then receive, while B does receive then send. However, often difficult to coordinate when there are many processors. 1.47
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, © 2004 Pearson Education Inc. All rights reserved. Non-blocking Often easiest to prevent deadlock by non-blocking communication, where processor can send and proceed before receive is finished. However, requires receiver buffer space which may fill (and hence cause blocking), and extra copying of messages, reducing performance. 1.48
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, © 2004 Pearson Education Inc. All rights reserved. Message Passing Interface—MPI An important communication standard. We will show some snippets of MPI to illustrate some of the issues, but MPI is a major topic that we cannot address in detail. many programs need only a few MPI features. There are many implementations of MPI: MPICH homepage Open MPI homepage Message Passing Interface Forum (official MPI standards documents)
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, © 2004 Pearson Education Inc. All rights reserved. Reasons to use MPI Standardized, with process to keep it evolving. Available on almost all parallel systems (free MPICH, Open MPI used on many clusters), with interfaces for C and Fortran. Supplies many communication variations and optimized functions for a wide range of needs. Supports large program development and integration of multiple modules. Many powerful packages and tools based on MPI. 1.50
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, © 2004 Pearson Education Inc. All rights reserved Flynn’s Classifications Flynn (1966) created a classification for computers based upon instruction streams and data streams:
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, © 2004 Pearson Education Inc. All rights reserved. SISD 1.52 Single instruction stream-single data stream (SISD) computer Single processor computer - single stream of instructions generated from program. Instructions operate upon a single stream of data items.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, © 2004 Pearson Education Inc. All rights reserved Single Instruction Stream-Multiple Data Stream (SIMD) Computer A specially designed computer - a single instruction stream from a single program, but multiple data streams exist. Instructions from program broadcast to more than one processor. Each processor executes same instruction in synchronism, but using different data. Developed because a number of important applications that mostly operate upon arrays of data. E.g. array processor and GPU
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, © 2004 Pearson Education Inc. All rights reserved. SIMD 1.54
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, © 2004 Pearson Education Inc. All rights reserved. Multiple Instruction Stream Single Data Stream (MISD) Multiple instructions operate on a single data stream. Uncommon architecture which is generally used for fault tolerance. Heterogeneous systems operate on the same data stream and must agree on the result. Examples include the space shuttle flight control computer.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, © 2004 Pearson Education Inc. All rights reserved. 1.56
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, © 2004 Pearson Education Inc. All rights reserved Multiple Instruction Stream-Multiple Data Stream (MIMD) Computer General-purpose multiprocessor system - each processor has a separate program and one instruction stream is generated from each program for each processor. Each instruction operates upon different data. Both the shared memory and the message- passing multiprocessors so far described are in the MIMD classification.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, © 2004 Pearson Education Inc. All rights reserved
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, © 2004 Pearson Education Inc. All rights reserved Interconnection Networks To provide physical path for messages sent from one computer to the other. Bandwidth, latency, diameter, and cost exhaustive (for small system, c(c-1)/2 links) interconnections Restricted direct connections –2- and 3-dimensional meshes –Hypercube (not now common) As an alternative to direct links between computers, Using Switches to route message between computers: –Crossbar –Trees –Multistage interconnection networks
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, © 2004 Pearson Education Inc. All rights reserved Two-dimensional array (mesh) Each node connect to its four nearest neighbors. Mesh and torus are popular due to ease of layout and expandability. Also three-dimensional - used in some large high performance systems. Links Computer/ processor
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, © 2004 Pearson Education Inc. All rights reserved Three-dimensional hypercube Each node connect to each of the dimension of the network. Advantage is that diameter of the network is given by log2p for a p-node network. Efficient communication
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, © 2004 Pearson Education Inc. All rights reserved Four-dimensional hypercube Hypercubes popular in 1980’s - not now
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, © 2004 Pearson Education Inc. All rights reserved Crossbar switch Switches Processors Memories Exhaustive connection using one switch for each connection, employed more in the shared memory system than message-passing system.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, © 2004 Pearson Education Inc. All rights reserved Tree Switch Switch element Root Links Processors Binary tree: each switch has two links to two switches below it. Tree height is logarithmic. Log2p levels with p leaves. Root could be the bottleneck as traffic increases toward the root under uniform request. One way is to add more links toward the top as in the fat binary tree.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, © 2004 Pearson Education Inc. All rights reserved Multistage Interconnection Network Example: Omega network Inputs Outputs 2´ 2 switch elements (straight-through or crossover connections) A number of levels of switches. The destination address bit is used to control, 0 for upper and 1 for down (from the most significant bit).
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, © 2004 Pearson Education Inc. All rights reserved. References Supercomputing 2009 tutorial Parallel Computing 101 Quentin F. Stout Christiane Jablonowski University of Michigan