Presentation is loading. Please wait.

Presentation is loading. Please wait.

FIT5174 Distributed & Parallel Systems

Similar presentations


Presentation on theme: "FIT5174 Distributed & Parallel Systems"— Presentation transcript:

1 FIT5174 Distributed & Parallel Systems
Lecture 7 Parallel Computer System Architectures FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture

2 These slides are based on slides and material by: Carlo Kopp
September 4, 1997 Acknowledgement These slides are based on slides and material by: Carlo Kopp FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture 2

3 Parallel Computing September 4, 1997 Parallel computing is a form of computation in which many instructions are carried out simultaneously It operates on the principle that large problems can often be divided into smaller ones, which are then solved concurrently (i.e. at the same time) There are several different forms of parallel computing: bit-level parallelism, instruction-level parallelism, data parallelism, and task parallelism. Serial computing Parallel computing FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture 3

4 Parallel Computing September 4, 1997 Contemporary computer applications require the processing of large amounts of data in sophisticated ways. Example include: parallel databases, data mining oil exploration web search engines, web based business services computer-aided diagnosis in medicine management of national and multi-national corporations advanced graphics and virtual reality, particularly in the entertainment industry networked video and multi-media technologies collaborative work environments Ultimately, parallel computing is an attempt to minimise time required to compute a problem, despite the performance limitations of individual CPUs / cores. FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture 4

5 Parallel Computing Terminology
September 4, 1997 There are different ways to classify parallel computers. One of the more widely used classifications, in use since 1966, is called Flynn's Taxonomy. Flynn's taxonomy distinguishes multi-processor computer architectures according to two independent dimensions of Instruction and Data. Each of these dimensions can have only one of two possible states: Single or Multiple. The 4 possible classifications according to Flynn. S I S D : Single Instruction, Single Data S I M D : Single Instruction, Multiple Data M I S D : Multiple Instruction, Single Data M I M D : Multiple Instruction, Multiple Data FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture 5

6 Concepts and Terminology
September 4, 1997 At the executable machine code level, programs are seen by the processor or core as a series of machine instructions, in some machine specific binary code; The common format of any instruction is that of an “operation code” or “opcode” and some “operands’, which are arguments the processor/core can understand; Typically, operands are held in registers in the processor/core which store several bytes of data, or memory addresses pointing to locations in the machine’s main memory; In a “conventional” or “general purpose” processor/core a single instruction combines one opcode with two or three operands, e.g. ADD R1, R2, R3 – add contents of R1 and R2, put result into R3 FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture 6

7 Flynn’s Classification
September 4, 1997 FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture 7

8 Flynn’s Classification - SISD
September 4, 1997 Single Instruction, Single Data (SISD): A serial (non-parallel or “conventional”) computer Single instruction: only one instruction stream is being acted on by the CPU during any one clock cycle Single data: only one data stream is being used as input during any one clock cycle Deterministic execution This is the oldest and until recently, the most prevalent form of computer Examples: most PCs, single CPU workstations and mainframes FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture 8

9 Flynn’s Classification - SIMD
September 4, 1997 Single Instruction, Multiple Data (SIMD): A type of parallel computer Single instruction: All processing units execute the same instruction at any given clock cycle Multiple data: Each processing unit can operate on a different data element This type of machine typically has an instruction dispatcher, a very high-bandwidth internal network, and a very large array of very small-capacity instruction units. Best suited for specialized problems characterized by a high degree of regularity, such as image processing, matrix algebra etc. Synchronous (lockstep) and deterministic execution Two varieties: Processor Arrays and Vector Pipelines Examples: Processor Arrays: Connection Machine CM-2, Maspar MP-1, MP-2 Vector Pipelines: IBM 9000, Cray C90, Fujitsu VP, NEC SX-2, Hitachi S820 FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture 9

10 Flynn’s Classification - SIMD
September 4, 1997 FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture 10

11 Flynn’s Classification - MISD
September 4, 1997 Multiple Instruction, Single Data (MISD): A single data stream is fed into multiple processing units. Each processing unit operates on data independently via independent instruction streams. Few actual examples of this class of parallel computer have ever existed. One was the experimental Carnegie-Mellon computer Some conceivable uses might be: multiple frequency filters operating on a single signal stream multiple cryptography algorithms attempting to crack a single coded message. FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture 11

12 Flynn’s Classification - MIMD
September 4, 1997 Multiple Instruction, Multiple Data (MIMD): Currently, the most common type of parallel computer. Most modern computers fall into this category. Multiple Instruction: every processor may be executing a different instruction stream Multiple Data: every processor may be working with a different data stream Execution can be synchronous or asynchronous, deterministic or non-deterministic Examples: most current supercomputers, networked parallel computer "grids" and multi-processor SMP computers - including some types of PCs. FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture 12

13 Parallel Computer Memory Architectures
September 4, 1997 Broadly divided into three categories Shared memory Distributed memory Hybrid Shared Memory Shared memory parallel computers vary widely, but generally have in common the ability for all processors to access all memory as global address space. Multiple processors can operate independently but share the same memory resources. Changes in a memory location effected by one processor are visible to all other processors. Shared memory machines can be divided into two main classes based upon memory access times: UMA and NUMA; Uniform Memory Access vs Non-Uniform Memory Access models. FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture 13

14 Parallel Computer - Shared Memory
September 4, 1997 FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture 14

15 Parallel Computer - Distributed Memory
September 4, 1997 Distributed Memory Distributed memory systems require a communication network to connect inter-processor memory. Processors have their own local memory. There is no concept of global address space across all processors. Because each processor has its own local memory, it operates independently. Changes it makes to its local memory have no effect on the memory of other processors. Hence, the concept of “cache coherency” does not apply. When a processor needs access to data in another processor, it is usually the task of the programmer to explicitly define how and when data is communicated. Synchronization between tasks is likewise the programmer's responsibility. The network “fabric” used for data transfers varies widely, though it can be as simple as Ethernet, or as complexed as a specialised bus or switching device. FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture 15

16 Parallel Computer - Distributed Memory
September 4, 1997 FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture 16

17 Parallel Computer - Hybrid Memory
September 4, 1997 Hybrid: The largest and fastest computers in the world today employ both shared and distributed memory architectures. The shared memory component is usually a cache coherent SMP machine. Processors on a given SMP can address that machine's memory as global. The distributed memory component is the networking of multiple SMPs. SMPs know only about their own memory - not the memory on another SMP. Therefore, network communications are required to move data from one SMP to another. Current trends seem to indicate that this type of memory architecture will continue to prevail and increase at the high end of computing for the foreseeable future. Advantages and Disadvantages: whatever is common to both shared and distributed memory architectures. FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture 17

18 Parallel Computer - Hybrid Memory
September 4, 1997 FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture 18

19 Parallel Programming Models
September 4, 1997 Overview There are several parallel programming models in common use: Shared Memory Threads Message Passing Data Parallel Hybrid Parallel programming models exist as an abstraction above hardware and memory architectures. Although it might not seem apparent, these models are NOT specific to a particular type of machine or memory architecture. In fact, any of these models can (theoretically) be implemented on any underlying hardware. FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture 19

20 Parallel Computing Performance
September 4, 1997 General Speed-up formula Execution time components Inherently sequential computations: (n) Potentially parallel computations: (n) Communication operations: (n,p) FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture 20

21 Speed-up Formula (n,p) (n)/p (n)/p + (n,p) Computations
September 4, 1997 Computations (n,p) Communications (n)/p Comps + Comms Speed-up (n)/p + (n,p) FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture 21

22 AmDahl’s Law of Speed-up
September 4, 1997 It states that a small portion of the program which cannot be parallelized will limit the overall speed-up available from parallelization. Any large mathematical or engineering problem will typically consist of several parallelizable parts and several non-parallelizable (sequential) parts. This relationship is given by the equation: where S is the speed-up of the program (as a factor of its original sequential runtime), and P is the fraction that is parallelizable. FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture 22

23 Interesting Amdahl Observation
September 4, 1997 If the sequential portion of a program is 10% of the runtime, we can get no more than a 10 x speed-up, regardless of how many processors are added. This puts an upper limit on the usefulness of adding more parallel execution units. FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture 23

24 Amdahl’s Law September 4, 1997 FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture 24

25 Parallel Efficiency Efficiency 0  (n,p)  1 Amdahl’s law
September 4, 1997 Efficiency 0  (n,p)  1 Amdahl’s law Let f = (n)/((n) + (n)); i.e., f is the fraction of the code which is inherently sequential Processors Speedup Efficiency = FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture 25

26 Examples September 4, 1997 95% of a program’s execution time occurs inside a loop that can be executed in parallel. What is the maximum speedup we should expect from a parallel version of the program executing on 8 CPUs? 20% of a program’s execution time is spent within inherently sequential code. What is the limit to the speedup achievable by a parallel version of the program? FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture 26

27 Amdahl’s Law limitations
September 4, 1997 Limitations of Amdahl’s Law Ignores (n,p) - overestimates speedup Assumes f constant, so underestimates speedup achievable Amdahl Effect Typically (n,p) has lower complexity than (n)/p As n increases, (n)/p dominates (n,p) As n increases, speedup increases As n increases, sequential fraction f decreases. Speedup n = 10,000 n = 1,000 n = 100 Processors FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture 27

28 Gustafson’s Law September 4, 1997 Gustafson's Law (also known as Gustafson-Barsis' law, 1988) states that any sufficiently large problem can be efficiently parallelized. Gustafson's Law is closely related to Amdahl's law, which gives a limit to the degree to which a program can be sped up due to parallelization. S(P) = P − α * (P − 1). where P is the number of processors, S is the speedup, and α the non-parallelizable part of the process Gustafson's law addresses the shortcomings of Amdahl's law, which cannot scale to match availability of computing power as the machine size increases. FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture 28

29 Gustafson’s Law September 4, 1997 Also, It removes the fixed problem size or fixed computation load on the parallel processors: instead, he proposes a fixed time concept which leads to scaled speed up. Amdahl's law is based on fixed workload or fixed problem size. It implies that the sequential part of a program does not change with respect to machine size (i.e, the number of processors). However the parallel part is evenly distributed by n processors. FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture 29

30 Performance Summary Performance terms Speedup Efficiency
September 4, 1997 Performance terms Speedup Efficiency What prevents linear speedup? Serial operations Communication operations Process start-up Imbalanced workloads Architectural limitations Analyzing parallel performance Amdahl’s Law Gustafson-Barsis’ Law FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture 30

31 Parallel Programming Examples
September 4, 1997 This example demonstrates calculations on 2-dimensional array elements, with the computation on each array element being independent from other array elements. The serial program calculates one element at a time in sequential order. Serial code could be of the form: do j = 1, n do i = 1, m a(i,j) = fcn(i,j) end do The calculation of elements is independent of one another - leads to an embarrassingly parallel situation. The problem should be computationally intensive FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture 31

32 Parallel Programming -2D example
September 4, 1997 Arrays elements are distributed so that each processor owns a portion of an array (subarray). Independent calculation of array elements insures there is no need for communication between tasks. Distribution scheme is chosen by other criteria, e.g. unit stride (stride of 1) through the subarrays. Unit stride maximizes cache/memory usage. After the array is distributed, each task executes the portion of the loop corresponding to the data it owns. For example: do j = mystart, myend do i = 1,m a(i,j) = fcn(i,j) end do Notice that only the outer loop variables are different from the serial solution. FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture 32

33 Pseudo-code initialize the array find out if I am MASTER or WORKER
September 4, 1997 find out if I am MASTER or WORKER if I am MASTER initialize the array send each WORKER info on part of array it owns send each WORKER its portion of initial array receive from each WORKER results else if I am WORKER receive from MASTER info on part of array I own receive from MASTER my portion of initial array # calculate my portion of array do j = my_first_column, my_last_column do i = 1,n a(i,j) = fcn(i,j) end do send MASTER results endif FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture 33

34 Pi Calculation : Serial solution
September 4, 1997 PI Calculation The value of PI can be calculated in a number of ways. Consider the following method of approximating PI Inscribe a circle in a square Randomly generate points in the square Determine the number of points in the square that are also in the circle Let r be the number of points in the circle divided by the number of points in the square PI ~ 4 r Note that the more points generated, the better the approximation FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture 34

35 Pi Calculation : Serial solution
September 4, 1997 Serial pseudo code for this procedure: npoints = 10000 circle_count = 0 do j = 1,npoints generate 2 random numbers between 0 and 1 xcoordinate = random1 ; ycoordinate = random2 if (xcoordinate, ycoordinate) inside circle then circle_count = circle_count + 1 end do PI = 4.0*circle_count/npoints Note that most of the time in running this program would be spent executing the loop Leads to an embarrassingly parallel solution Computationally intensive Minimal communication Minimal I/O FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture 35

36 Pi Calculation : Parallel Solution
September 4, 1997 Parallel Solution Parallel strategy: break the loop into portions that can be executed by the tasks. For the task of approximating Pi: Each task executes its portion of the loop a number of times. Each task can do its work without requiring any information from the other tasks (there are no data dependencies). Uses the SPMD** model. One task acts as master and collects the results. Pseudo code solution: red highlights changes for parallelism. [**SPMD: (Single Process, Multiple Data) or (Single Program, Multiple Data) Tasks are split up and run simultaneously on multiple processors with different input in order to obtain results faster. SPMD is the most common style of parallel programming. It is a subcategory of MIMD of Flynn’s Taxonomy] FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture 36

37 Pi Calculation : Parallel Solution pseudocode
September 4, 1997 Pi Calculation : Parallel Solution pseudocode npoints = 10000 circle_count = 0 p = number of tasks num = npoints/p find out if I am MASTER or WORKER do j = 1,num generate 2 random numbers between 0 and 1 xcoordinate = random1 ; ycoordinate = random2 if (xcoordinate, ycoordinate) inside circle then circle_count = circle_count + 1 end do if I am MASTER receive from WORKERS their circle_counts compute PI (use MASTER and WORKER calculations) else if I am WORKER send to MASTER circle_count endif FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture 37

38 1-D Wave Equation Parallel Solution
September 4, 1997 1-D Wave Equation Parallel Solution Implement as an SPMD model The entire amplitude array is partitioned and distributed as sub-arrays to all tasks. Each task owns a portion of the total array. Load balancing: all points require equal work, so the points should be divided equally A block decomposition would have the work partitioned into the number of tasks as chunks, allowing each task to own mostly contiguous data points. FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture 38

39 1-D Wave Equation Parallel Solution
September 4, 1997 1-D Wave Equation Parallel Solution Communication need only occur on data borders. The larger the block size the less the communication. The equation to be solved is the one-dimensional wave equation: A(i, t+1) = (2.0 * A(i, t)) - A(i, t-1) + (c * (A(i-1, t) - (2.0 * A(i, t)) + A(i+1, t))) where c is a constant Note that amplitude will depend on previous timesteps (t, t-1) and neighboring points (i-1, i+1). Data dependence will mean that a parallel solution will involve communications. FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture 39

40 1-D Wave Equation Parallel Solution
September 4, 1997 find out number of tasks and task identities #Identify left and right neighbors left_neighbor = mytaskid – 1; right_neighbor = mytaskid +1 if mytaskid = first then left_neigbor = last if mytaskid = last then right_neighbor = first find out if I am MASTER or WORKER if I am MASTER initialize array ; send each WORKER starting info and subarray else if I am WORKER receive starting info and subarray from MASTER endif #Update values for each point along string #In this example the master participates in calculations do t = 1, nsteps send left endpoint to left neighbor ; receive left endpoint from right neighbor send right endpoint to right neighbor ; receive right endpoint from left neighbor #Update points along line do i = 1, npoints newval(i) = (2.0 * values(i)) - oldval(i) + (sqtau * (values(i-1) - (2.0 * values(i)) + values(i+1))) end do #Collect results and write to file receive results from each WORKER write results to file send results to MASTER FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture 40


Download ppt "FIT5174 Distributed & Parallel Systems"

Similar presentations


Ads by Google