Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder Chapter 4: First Steps Toward Parallel Programming
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Toward writing parallel programs Build intuition toward parallelism When to parallelize When overhead is too great Consider –Data allocation –Work allocation –Data structure design –Algorithms 4-2
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 3 ways to formulate parallel computations Unlimited Parallelism Fixed Parallelism Scalable Parallelsim 4-3
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 2 classes of parallel algorithms Data parallel Task parallel 4-4
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Data parallel Perform same computation to different data items at the same time. Parallelism grows as data grows Example –P chefs preparing N meals –Each chef prepares N/P meals –As N increases, also increase P, limited by constraints 4-5
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Task parallel Perform distinct computations at the same time Number of tasks typically fixed Not scalable Example –Chef for salad, chef for dessert, chef appetizer –There are dependencies among tasks –Utilizes pipelining Hybred of data and task is often used 4-6
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Pseudo code – Peril-L Minimal, easy to learn Universal to any language Allow reasoning about performance Will extend C 4-7
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Perl-L Threads –forall (i in (1..12)) printf(“Hello %i\n”,i); Prints 12 Hello’s in random order Threads compete and execute in parallel 4-8
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Perl-L exclusive –One thread executes body at a time forall (i in (1..12)){ exclusive { printf(“Hello %i\n”,i); }} barrier –Forces all threads to stop at the barrier until all threads arrive at which point they continue 4-9
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Perl-L barrier –All threads wait for all to arrive, then continue forall (i in (1..12)) { printf(“tweedle dee \n”); barrier; printf(“tweedle dum \n”); } All tweedle dee’s print before tweedle dum’s 4-10
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Peril-l memory model Global –Variables visible to all threads –Outside a forall –Variables underlined Local –Variables visible to only local thread –Inside a forall –Variables not underlined 4-11
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Perl-l Multiple reads concurrent One write –Allows race conditions, last write wins 4-12
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Connecting global and local memory Global memory is distributed to local memory Localize takes global memory to make it local int allData[n]; // global forall (thdID in (0..P-1)) { // spawn threads int size = n/P; // size of allocations int locData[size]=localize(allData[]); // map globals to this thd locals 4-13
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Connecting global and local memory (cont) Modification to local data is same as modifying global data but with out λ delay of accessing nonlocal memory 4-14
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Issues of localization of global memory Global arrays use local indices which start at 0 Multiple threads on a processor keep data local to the thread There is no local copy, both local and global reference the same memory location? 4-15
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Handy functions size = mySize(global,i) –Feturns the size of the ith dimension of the local portion of the global array localToGlobal(locData, i, j) –Returns global index corresponds to ith index of the jth dimension of the local array, locData 4-16
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Full Empty variables - synchronization Like matter, next slide Incurs over head like global memory, λ int t’=0; //declare empty t and fill it 4-17
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 4-18 Table 4.1 Semantics of full/empty variables.
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Reduce/Scan Reduce – combines a set of values to produce a single value –Written with / –+/count //add elements of count Scan – parallel prefix computation, embodies logic that performs a sequential operation in parts and carries along the intermediate results –Written with \ –Min\items //scan, ie find smallest of items’ prefixs 4-19
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Additional example least = min/dataArray; //scalar stored in local //least of each thread. reduce/scan can combine values across multiple threads 4-20
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley More examples - reduce count – local in each thread total=+/count; Combined into a single result stored in each thread 4-21
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley More examples - scan count local to each thread beforeMe =+\count; count variables are accumulate so the ith thread has its beforeMe variable assigned the sum of the first i count values 4-22
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Implied Reduce - Scan synchronization Consider largest = max/localTotal; All threads must arrive at this statement to perform the summation. Threads proceed only after the assignment 4-23
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Programming consideration exclusive { total +=priv_count; } //done serially Versus Total =+/priv_count; //done with tree structure Converts from O(p) to O(lg P) 4-24
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 4-25 Figure 4.1 The Count 3s computation (Try 3) written in the Peril-L notation.
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Formulating Parallelism Fixed Parallelism –Write code designed for a particular machine –Improving the machine may not increase parallelism Unlimited Parallelism –Use forall ( i in (0.. n-1) –Will use available resources –Will require substantial thread communication 4-26
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 4-27 Figure 4.2 Fixed Parallelism solution to Count 3s (t=4).
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Formulating Parallelism (cont) Scalable –As follows: Determine how components (data structures, work load, etc) grow as n increases. Formulate a set S of substantial subproblems where natural units of the solution are assigned to each S Solve each S independently –Utilizes locality 4-28
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 4-29 Figure 4.3 Scalable Parallelism solution to Count 3s. Notice that the array segment has been localized.
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 4-30 Table 4.2 Helper functions.
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 4-31 Figure 4.4 Odd/Even Interchange to alphabetize a list L of records on field x.
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 4-32 Figure 4.5 Fixed 26-way parallel solution to alphabetizing. The function letRank(x) returns the 0- origin rank of the Latin letter x.
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 4-33 Figure 4.6
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 4-34 Table 4.3 Merge operations.
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 4-35 Figure 4.7 Peril-L program using Batcher’s sort to alphabetize records in L.
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 4-36 Figure 4.7 Peril-L program using Batcher’s sort to alphabetize records in L. (cont.)