Classification of parallel computers Limitations of parallel processing
Classification of parallel computers 1966 M. J. Flynn has made an informal classification of computer parallelism based on the number of simultaneous instruction and data streams, which can be distinguished during operation of a computer system
Classification of parallel computers SISD Single Instruction Stream Single Data Stream Conventional architectures (von Neumann’s) Vector computers? PU CU MM DSIS PU – Processing Unit CU – Control Unit MM – Memory Module IS – Instruction Stream DS – Data Stream
Classification of parallel computers SIMD Single Instruction Stream Multiple Data Streams Vector computers? Array computers PE 1 PE 2 PE n CU MM 1 MM 2 MM m DS 1 DS 2 DS n IS
Classification of parallel computers MISD Multiple Instruction Streams Single Data Stream Nonexistent, systolic array, pipelining? PU 1 PU 2 PU n CU 1 CU 2 CU n MM 1 MM 2 MM m DS. IS 1 IS 2 IS n
Classification of parallel computers MIMD Multiple Instruction Streams Multiple Data Streams Multiprocessor systems Multicomputer systems PE 1 PE 2 PE n CU 1 CU 2 CU n MM 1 MM 2 MM m DS 1 DS 2 DS n IS 1 IS 2 IS n
Classification of parallel computers M. J. Flynn Advantages of classification Simplicity Disadvantages Does not include all solutions and classes MISD is an empty layer MIMD is overloaded layer
Classification of parallel computers according to sources of parallelism Data-level parallelism The same operation performed on multiple data units Instruction-level parallelism (low-level parallelism) Instruction Pipelining Multiple Processor Functional Units Data flow analysis / out-of-order execution / branch prediction Process/Thread-level parallelism (high-level parallelism)
Classification of parallel computers according to sources of parallelism Data-Level Parallelism (DLP) Time Instruction-Level Parallelism (ILP) Time Thread-Level Parallelism (TLP)
First mechanisms of parallel processing Evolution of I/O functions Interrupts DMA I/O processors Development of memory Virtual memory Cache memory Multiplied ALUs (IBM 360, CDC 6600)
First mechanisms of parallel processing Pipelining Pipelined control unit (Instruction Pipelining) Pipelined arithmetic-logic unit (Arithmetic pipelining)
Limitations of parallel processing How much faster is my program going to be run on a parallel computer than on a machine without any mechanisms of parallel processing (uniprocessor)?
Time and Processor Complexity Given an algorithm and an input data of a given size n Time complexity T(n) is a number of time steps needed to execute the algorithm for the given input of size n Processor complexity P(n) is a number of processors used in the execution of the algorithm for the given input of size n T(p,n) is a number of time steps needed to execute the algorithm on p processors for the given input of size n
Speedup Speedup S(p,n) gives the factor of acceleration going from a sequential execution of an algorithm on one processor to the parallel execution of the parallel algorithm on p processors
Speedup T*(1, n) is the execution time for the best known sequential algorithm Typically 1<= S(p,n) <= p S(p,n) = p – ideal speedup S(p,n) > p – superlinear speedup
= P0P0 = P1P1 = P2P2 = P3P3 = P4P4 = P5P5 = P6P6 = P7P7 = P0P0 += S = A[0] S += A[1] S += A[2] S += A[3] S += A[4] S += A[5] S += A[6] S += A[7] += Time
Cost Cost is execution time times number of processors A cost-optimal algorithm is an algorithm for which the cost to solve a problem on parallel system is proportional to the cost on a single processor
= P0P0 = P1P1 = P2P2 = P3P3 = P4P4 = P5P5 = P6P6 = P7P7 = P0P0 += S = A[0] S += A[1] S += A[2] S += A[3] S += A[4] S += A[5] S += A[6] S += A[7] += Time
Efficiency Efficiency relates the speedup to a number of processors Efficiency represents fraction of time for which a processor does a useful work
= P0P0 = P1P1 = P2P2 = P3P3 = P4P4 = P5P5 = P6P6 = P7P7 = P0P0 += S = A[0] S += A[1] S += A[2] S += A[3] S += A[4] S += A[5] S += A[6] S += A[7] += Time
Amdahl’s Law Assuming the constant size of a problem, what is the maximum speedup that can be obtained with the use of parallel processing?
Amdahl’s Law Sequential part Perfectly parallelizable part (1-f) / 2f 1 f1-f (1-f) / 4 f p = 1 p = 2 p = 4
Amdahl’s Law
Gustafson’s Law Is the reduced execution time obtained thanks to parallel processing always of the highest priority? What about the increased amount of work that can be performed thanks to application of parallel processing within the same period of time?
Gustafson’s Law Sequential part Perfectly parallelizable part 1-ff 1 f f p = 1 p = 2 p = 4
Gustafson’s Law
Grain of parallelism Fine grain of parallelism type of problems Coarse grain of parallelism type of problems
Loosely & Tightly Coupled Design Approach Loosely & tightly coupled design approach represents the degree of internal coupling between components of the computer system Mentioned degree of coupling corresponds to the overhead associated with the communication as well as the potential to upscale a given design