1 Parallel Processing Fundamental Concepts
2 Selection of an Application for Parallelization Can use parallel computation for 2 things: –Speed up an existing application –Improve quality of result for an application more compute power allows revolutionary change in algorithm Application should be compute intensive and unless significant speedup is achievable, parallelization is not worth effort
3 Fundamental Limits: Amdahl’s Law T1 = execution time using 1 processor (serial execution time) Tp = execution time using P processors S = serial fraction of computation (i.e. fraction of computation which can only be executed using 1 processor) C = fraction of computation which could be executed by p processors Then S + C = 1 and Tp = S * T1+ (T1 * C)/P = (S + C/P)T1 Speedup = Ψ(p) = T1/Tp = 1/(S+C/P)
4 Fundamental Limits: Amdahl’s Law (cont.) Speedup Ψ(p) = T1/Tp = 1/(S+C/P) Maximum speedup ( by using infinite number of processors = 1/S example: S = 0.05, MaxSpeedup, Smax = 20
Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 5 Scalability of Multithreaded Applications Amdahl’s Law Speedup is limited by the amount of serial code Maximum Theoretical Speedup from Amdahl's Law Number of cores Speedup %serial= 0 %serial=10 %serial=20 %serial=30 %serial=40 %serial=50 Ψ(p) ≤ 1 s + (1 - s) / p where 0 ≤ s ≤ 1, the fraction of serial operations
Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 6 Scalability of Multithreaded Applications Question A: 1.25 B: 2.0 C: 4.0 D: No speedup If application is only 25% serial, what’s the maximum speedup you can ever achieve, assuming infinite number of processors ? (ignore parallel overhead)
7 Speedup Ψ(p) = T1/Tp = 1/(S+ C/P) Serial fractions appear in “non-obvious” ways Example application profile: 1.input: 10% 2.compute setup: 15% 3.computation: 75% If only part 3 parallelized: Smax = 4 If only part 2 and 3 parallelized: Smax = 10 Have to live with Smax if you cannot change the algorithm implemented in your application such that it makes better use of a parallel machine.
8 Speedups with Scaled Problem Sizes Determine speedup given constant problem size Determine speedup given constant turnaround time –assume perfect speedup and determine the problem size that can be computed given the same turnaround time
9 Type of Parallelization: User controlled vs automatic User controlled: –Programmer tells all processors what to do at all times –More freedom, but significant effort from programmer –Several problems, for example Programmer may not know the details of the machine as compiler Sophistication needed to write programs with good locality & grain (i.e. work size assigned to each processor) Automatic: compilers
10 Approaches: Exact vs Inexact Begin with sequential code Exact parallelism: –Definition: all data dependences remain intact –Advantage: answer guaranteed to be the same as sequential implementation independent of the number of processors –Problem: unnecessary dependences causes inefficiency
11 Approaches: Exact vs Inexact (continued) Inexact parallelism: –Definition: “relax”data dependences: allow “stale” data to be used, instead of most-up-to-date –Used in both numerical solution techniques and combinatorial optimization –Reduces synchronization overhead –Usually in context of iterative algorithms which still converge to right answer –may or may not be faster
12 Speculative Parallelism Do more work than may need to be done Example: execute the two sides of an IF statement
13 Orthogonal Parallelism Think about parallelism like slicing an apple: –Make N cuts on X axis: N pieces –Make M cuts on Y axis: NxM pieces –Make K cuts on Z axis: NxMxK pieces
14 Example Nested loop for I=1,10 for J=1,10 for K = 1,5 “independent work” Orthogonal parallelism means creating 500 different threads each executing an instance of the “independent work”
15 Design Tradeoffs in Parallel Decomposion “Granularity” vs “Communication/Synchronization” vs “Load Balance” Granularity: Amount of computation between interprocess communication Interprocess Communication: data transmission or synchronization Load Balance: Distributing work evenly between processes (threads)
16 Granularity (continued) Grain size –fine: program chopped into many small pieces –coarse: fewer larger pieces
17 Choice of Granuality Impact Parallel decomposition overhead: –As granularity decreases, overhead increases –e.g. time taken by each process to obtain task (serialization if single task queue) Load balance: –As granularity decreases, get better balance –Better distribution of work between processors
18 Graph Typical graph of execution time using P processors (if grain size can be varied) Time Overhead dominates Load imbalance dominates Granularity Fine
19 Execution Time Execution time is NOT S + (C/P) Execution time = S + (C/P)(1+ Kp) + Op Kp: Cost due to load imbalance and communication/synchronization Op: Other overhead
20 Scheduling: Static vs Dynamic If grain size is constant and the number of tasks is known, then can statically assign tasks to processors (e.g. at compile time) –reduce overhead of work assignment to processors If not, then need some dynamic scheduling mechanism (e.g. task queue, self-scheduled loop) Possible even to have a dynamic decision about whether or not to spawn (create additional process)
Static Scheduling of Parallel loops One of the most popular constructs for shared- memory programming is the parallel loop. A parallel loop is a “for” or “do” statement, except that it doesn’t iterate. –Instead, it says “just get all these things done, in any order, using several processors if possible.” –The number of processors available to the job may be specified or limited 21
An example parallel loop c = sin (d) parallel do i=1 to 30 a(i) = b(i) + c end parallel do e = a(20)+ a(15) 22
Implementation of parallel loops using static scheduling c = sin (d) start_task sub(a,b,c,1,10) start_task sub(a,b,c,11,20) call sub(a,b,c,21,30) wait_for_all_tasks_to_complete e = a(20)+ a(15)... subroutine sub(a,b,c,k,l)... for i=k to l a(i) = b(i) + c end for end sub 23 c = sin (d) parallel do i=1 to 30 a(i) = b(i) + c end parallel do e = a(20)+ a(15) Notice that, in this program, arrays a and b are shared by the three processors cooperating in the execution of the loop.
Implementation of parallel loops using static scheduling c = sin (d) start_task sub(a,b,c,1,10) start_task sub(a,b,c,11,20) call sub(a,b,c,21,30) wait_for_all_tasks_to_complete e = a(20)+ a(15)... subroutine sub(a,b,c,k,l)... for i=k to l a(i) = b(i) + c end for end sub 24 c = sin (d) parallel do i=1 to 30 a(i) = b(i) + c end parallel do e = a(20)+ a(15) This program assigns to each processor a fixed segment of the iteration space. This is called static scheduling.
Implementation of parallel loops using dynamic scheduling c = sin (d) start_task sub(a,b,c) call sub(a,b,c) wait_for_all_tasks_to_complete e = a(20)+ a(15)... subroutine sub(a,b,c) logical empty... call get_another_iteration(empty,i) while.not. empty do a(i) = b(i) + c call get_another_iteration(empty,i) end while end sub 25 c = sin (d) parallel do i=1 to 30 a(i) = b(i) + c end parallel do e = a(20)+ a(15) Here, the get_another_iteration() subroutine accesses a pool containing all n iteration numbers, gets one of them, and removes it from the pool. When all iterations have been assigned, and therefore the pool is empty, the function returns.true. in variable empty. On the next slide we show a third approach in which get_another_iteration() returns a range of iterations instead of a single iteration:
Another alternative subroutine sub(a,b,c,k,l) logical empty... call get_another_iteration(empty,i,j) while.not. empty do for k=i to j a(k) = b(k) + c end for call get_another_iteration(empty,i,j) end while end sub 26 c = sin (d) parallel do i=1 to 30 a(i) = b(i) + c end parallel do e = a(20)+ a(15) get_another_iteration() returns a range of iterations instead of a single iteration: c = sin (d) start_task sub(a,b,c) call sub(a,b,c) wait_for_all_tasks_to_complete e = a(20)+ a(15)
Array Programming Languages Array operations are written in a compact form that makes programs more readable. Consider the loop: s=0 do i=1,n a(i)=b(i)+c(i) s=s+a(i) end do It can be written (in Fortran 90 notation) as follows: a(1:n) = b(1:n) +c(1:n) s=sum(a(1:n)) A popular array language today is MATLAB. 27 vector operation reduction function
Parallelizing Vector Expressions All the arithmetic operations (+, -, * /, **) involved in a vector expression can be performed in parallel. Intrinsic reduction functions also can be performed in parallel. Vector operations can be easily executed in parallel using almost any form of parallelism including pipelining and multiprocessing. 28
Array Programming Languages (cont.) Array languages can be used to express parallelism because array operations can be easily executed in parallel. Vector programs are easily translated for execution on shared memory parallel machines. 29 c = sin(d) a(1:30)=b(2:31) + c e=a(20)+a(15) c = sin (d) parallel do i=1 to 30 a(i) = b(i+1) + c end parallel do e = a(20)+ a(15) Translated to
30 Typical Parallel Programs Bottlenecks: Task Queue One central task queue may be a bottleneck –q distributed task queues with p threads, q <= p –distributes contention across queues –Cost: if the 1st queue a processor checks is empty, the processor has to go and look at others Task insertion is an issue: –If the number of tasks generated by each processor is uniform--> each processor is assigned a specific task queue for insertion –If task generation is non-uniform (e.g. one processor generates all tasks)-- > tasks should be uniformly, randomly spread among queues Priority queues: Give more important jobs higher priority --> get executed sooner