Presentation is loading. Please wait.

Presentation is loading. Please wait.

Selections from CSE332 Data Abstractions course at University of Washington 1.Introduction to Multithreading & Fork-Join Parallelism –hashtable, vector.

Similar presentations


Presentation on theme: "Selections from CSE332 Data Abstractions course at University of Washington 1.Introduction to Multithreading & Fork-Join Parallelism –hashtable, vector."— Presentation transcript:

1 Selections from CSE332 Data Abstractions course at University of Washington 1.Introduction to Multithreading & Fork-Join Parallelism –hashtable, vector sum 2.Analysis of Fork-Join Parallel Programs –Maps, vector addition, linked lists versus trees for parallelism, work and span 3.Parallel Prefix, Pack, and Sorting 4.Shared-Memory Concurrency & Mutual Exclusion –Concurrent bank account, OpenMP nested locks, OpenMP critical section 5.Programming with Locks and Critical Sections –Simple minded concurrent stack, hashtable revisited 6.Data Races and Memory Reordering, Deadlock, Reader/Writer Locks, Condition Variables 1

2 A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 1 Introduction to Multithreading & Fork-Join Parallelism Dan Grossman Last Updated: August 2011 For more information, see http://www.cs.washington.edu/homes/djg/teachingMaterials/

3 What to do with multiple processors? Next computer you buy will likely have 4 processors –Wait a few years and it will be 8, 16, 32, … –The chip companies have decided to do this (not a “law”) What can you do with them? –Run multiple totally different programs at the same time Already do that? Yes, but with time-slicing –Do multiple things at once in one program Our focus – more difficult Requires rethinking everything from asymptotic complexity to how to implement data-structure operations 3Sophomoric Parallelism and Concurrency, Lecture 1

4 Parallelism vs. Concurrency Note: Terms not yet standard but the perspective is essential –Many programmers confuse these concepts 4Sophomoric Parallelism and Concurrency, Lecture 1 There is some connection: –Common to use threads for both –If parallel computations need access to shared resources, then the concurrency needs to be managed First 3ish lectures on parallelism, then 3ish lectures on concurrency Parallelism: Use extra resources to solve a problem faster resources Concurrency: Correctly and efficiently manage access to shared resources requests work resource

5 An analogy CS1 idea: A program is like a recipe for a cook –One cook who does one thing at a time! (Sequential) Parallelism: –Have lots of potatoes to slice? –Hire helpers, hand out potatoes and knives –But too many chefs and you spend all your time coordinating Concurrency: –Lots of cooks making different things, but only 4 stove burners –Want to allow access to all 4 burners, but not cause spills or incorrect burner settings 5Sophomoric Parallelism and Concurrency, Lecture 1

6 Sum(int arr[], int len) { int res[4]; FORALL(i=0; i < 4; i++) { //parallel iterations res[i] = sumRange(arr,i*len/4,(i+1)*len/4); } return res[0]+res[1]+res[2]+res[3]; } int sumRange(int arr[], int lo, int hi) { result = 0; for(j=lo; j < hi; j++) result += arr[j]; return result; } Parallelism Example Parallelism: Use extra computational resources to solve a problem faster (increasing throughput via simultaneous execution) Pseudocode for array sum –Bad style for reasons we’ll see, but may get roughly 4x speedup 6Sophomoric Parallelism and Concurrency, Lecture 1

7 Concurrency Example Concurrency: Correctly and efficiently manage access to shared resources (from multiple possibly-simultaneous clients) Pseudocode for a shared chaining hashtable –Prevent bad interleavings (correctness) –But allow some concurrent access (performance) 7Sophomoric Parallelism and Concurrency, Lecture 1 class Hashtable { … void insert(K key, V value) { int bucket = …; prevent-other-inserts/lookups in table[bucket] do the insertion re-enable access to arr[bucket] } V lookup(K key) { (like insert, but can allow concurrent lookups to same bucket) }

8 Shared memory The model we will assume is shared memory with explicit threads Old story: A running program has –One call stack (with each stack frame holding local variables) –One program counter (current statement executing) –Static fields –Objects (created by new ) in the heap (nothing to do with heap data structure) New story: –A set of threads, each with its own call stack & program counter No access to another thread’s local variables –Threads can (implicitly) share static fields / objects To communicate, write somewhere another thread reads 8Sophomoric Parallelism and Concurrency, Lecture 1

9 Shared memory 9Sophomoric Parallelism and Concurrency, Lecture 1 … pc=… … … … Unshared: locals and control Shared: objects and static fields Threads each have own unshared call stack and current statement – (pc for “program counter”) – local variables are numbers, null, or heap references Any objects can be shared, but most are not

10 First attempt, Sum class - serial 10Sophomoric Parallelism and Concurrency, Lecture 1 class Sum { int ans; int len; int *arr; public: Sum (int a[], int num) { //constructor arr=a, ans = 0; for (int i=0; i < num; ++i) { arr[i] = i+1; //initialize array } int sum (int lo, int hi) { int ans = 0; for(int i=lo; i < hi; ++i) { ans += arr[i]; } return ans; }

11 Parallelism idea Example: Sum elements of a large array Idea Have 4 threads simultaneously sum 1/4 of the array –Warning: Inferior first approach – explicitly using fixed number of threads ans0 ans1 ans2 ans3 + ans –Create 4 threads using openMP parallel –Determine thread ID, assign work based on thread ID –Accumulate partial sums for each thread –Add together their 4 answers for the final result 11Sophomoric Parallelism and Concurrency, Lecture 1

12 OpenMP basics First learn some basics of OpenMP 1.Pragma based approach to parallelism 2.Create pool of threads using #pragma omp parallel 3.Use work sharing directives to allow each thread to do work 4.To get started, we will explore these worksharing directives: 1.parallel for 2.reduction 3.single 4.task 5.taskwait Most of these constructs have an analog in Intel® Cilk plus™ as well 12Sophomoric Parallelism and Concurrency, Lecture 1

13 Demo 13Sophomoric Parallelism and Concurrency, Lecture 1 Walk through of parallel sum code – SPMD version Discussion points: #pragma omp parallel num_threads clause

14 Demo Walk through of OpenMP parallel for with reduction code Discussion points: #pragma omp parallel for reduction(+:ans) clause 14Sophomoric Parallelism and Concurrency, Lecture 1

15 Shared memory? Fork-join programs (thankfully) don’t require much focus on sharing memory among threads But in languages like C++, there is memory being shared. In our example: –lo, hi, arr fields written by “main” thread, read by helper thread –ans field written by helper thread, read by “main” thread When using shared memory, you must avoid race conditions –With concurrency, we’ll learn other ways to synchronize later such as the use of atomics, locks, critical sections 15Sophomoric Parallelism and Concurrency, Lecture 1

16 Divide and Conquer Approach Want to use (only) processors “available to you now” –Not used by other programs or threads in your program Maybe caller is also using parallelism Available cores can change even while your threads run –If you have 3 processors available and using 3 threads would take time X, then creating 4 threads would take time 1.5X 16Sophomoric Parallelism and Concurrency, Lecture 1 // numThreads == numProcessors is bad // if some are needed for other things int sum(int arr[], int numThreads){ … }

17 Divide and Conquer Approach Though unlikely for sum, in general sub-problems may take significantly different amounts of time –Example: Apply method f to every array element, but maybe f is much slower for some data items Example: Is a large integer prime? –If we create 4 threads and all the slow data is processed by 1 of them, we won’t get nearly a 4x speedup Example of a load imbalance 17Sophomoric Parallelism and Concurrency, Lecture 1

18 Divide and Conquer Approach The counterintuitive (?) solution to all these problems is to use lots of threads, far more than the number of processors –But this will require changing our algorithm 18Sophomoric Parallelism and Concurrency, Lecture 1 ans0 ans1 … ansN ans 1.Forward-portable: Lots of helpers each doing a small piece 2.Processors available: Hand out “work chunks” as you go If 3 processors available and have 100 threads, then ignoring constant-factor overheads, extra time is < 3% 3.Load imbalance: No problem if slow thread scheduled early enough Variation probably small anyway if pieces of work are small

19 Divide-and-Conquer idea This is straightforward to implement using divide-and-conquer –Parallelism for the recursive calls 19Sophomoric Parallelism and Concurrency, Lecture 1 +++++ +++ +++ + ++ +

20 Demo Walk through of Divide and Conquer code Discussion points: #pragma omp task omp shared omp firstprivate omp taskwait 20Sophomoric Parallelism and Concurrency, Lecture 1

21 Demo Walk through of load balanced OpenMP parallel for Discussion points: Omp scheduling(dynamic) 21Sophomoric Parallelism and Concurrency, Lecture 1

22 A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 2 Analysis of Fork-Join Parallel Programs Dan Grossman Last Updated: August 2011 For more information, see http://www.cs.washington.edu/homes/djg/teachingMaterials/

23 Even easier: Maps (Data Parallelism) A map operates on each element of a collection independently to create a new collection of the same size –No combining results –For arrays, this is so trivial some hardware has direct support Canonical example: Vector addition 23Sophomoric Parallelism and Concurrency, Lecture 2 int vector_add(int lo, int hi){ FORALL(i=lo; i < hi; i++) { result[i] = arr1[i] + arr2[i]; } return 1; }

24 Maps in ForkJoin Framework Even though there is no result-combining, it still helps with load balancing to create many small tasks –Maybe not for vector-add but for more compute-intensive maps –The forking is O(log n) whereas theoretically other approaches to vector-add is O(1) 24Sophomoric Parallelism and Concurrency, Lecture 2 class VecAdd { int *arr1, *arr2, *res, len; public: VecAdd(int _arr1[],int _arr2[],int _res[],int _num) { arr1 = _arr1; arr2 = _arr2; res = _res; len = _num; } int vector_add( int lo, int hi) { #pragma omp parallel for for( int i=lo; i < hi; ++i) { res[i] = arr1[i] + arr2[i]; } return 1; } } //end class definition

25 Maps and reductions Maps and reductions: the “workhorses” of parallel programming –By far the two most important and common patterns Two more-advanced patterns in next lecture –Learn to recognize when an algorithm can be written in terms of maps and reductions –Use maps and reductions to describe (parallel) algorithms –Programming them becomes “trivial” with a little practice Exactly like sequential for-loops seem second-nature 25Sophomoric Parallelism and Concurrency, Lecture 2

26 Trees Maps and reductions work just fine on balanced trees –Divide-and-conquer each child rather than array subranges –Correct for unbalanced trees, but won’t get much speed-up Example: minimum element in an unsorted but balanced binary tree in O( log n) time given enough processors How to do the sequential cut-off? –Store number-of-descendants at each node (easy to maintain) –Or could approximate it with, e.g., AVL-tree height 26Sophomoric Parallelism and Concurrency, Lecture 2

27 Linked lists Can you parallelize maps or reduces over linked lists? –Example: Increment all elements of a linked list –Example: Sum all elements of a linked list 27Sophomoric Parallelism and Concurrency, Lecture 2 bcdef frontback Once again, data structures matter! For parallelism, balanced trees generally better than lists so that we can get to all the data exponentially faster O( log n) vs. O(n) –Trees have the same flexibility as lists compared to arrays

28 Work and Span Let T P be the running time if there are P processors available Two key measures of run-time: Work: How long it would take 1 processor = T 1 –Just “sequentialize” the recursive forking Span: How long it would take infinity processors = T  –The longest dependence-chain –Example: O( log n) for summing an array since > n/2 processors is no additional help –Also called “critical path length” or “computational depth” 28Sophomoric Parallelism and Concurrency, Lecture 2

29 A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 4 Shared-Memory Concurrency & Mutual Exclusion Dan Grossman Last Updated: May 2011 For more information, see http://www.cs.washington.edu/homes/djg/teachingMaterials/

30 Canonical Bank Account Example Correct code in a single-threaded world 30Sophomoric Parallelism & Concurrency, Lecture 4 class BankAccount { private int balance = 0; int getBalance() { return balance; } void setBalance(int x) { balance = x; } void withdraw(int amount) { int b = getBalance(); if(amount > b) try {throw WithdrawTooLargeException;} setBalance(b – amount); } … // other operations like deposit, etc. }

31 A bad interleaving Interleaved withdraw(100) calls on the same account –Assume initial balance == 150 31Sophomoric Parallelism & Concurrency, Lecture 4 int b = getBalance(); if(amount > b) try{throw …;} setBalance(b – amount); int b = getBalance(); if(amount > b) try{throw …;} setBalance(b – amount); Thread 1 Thread 2 Time “Lost withdraw” – unhappy bank

32 What we need – an abstract data type for mutual exclusion There are many ways out of this conundrum, but we need help from the language One basic solution: Locks, or critical sections –OpenMP implements locks as well as critical sections. They do similar things, but have subtle differences between them #pragma omp critical { // allows only one thread access at a time in the code block // code block must have one entrance and one exit } omp_lock_t myLock; –omp locks have to be initialized before use. –Can be locked with omp_set_nest_lock, and unlocked with omp_unset_nest_lock –Only one thread may hold the lock –Allows for exception handling and non structured jumps 32Sophomoric Parallelism & Concurrency, Lecture 4

33 Almost-correct pseudocode 33Sophomoric Parallelism & Concurrency, Lecture 4 class BankAccount { private int balance = 0; private Lock lk = new Lock(); … void withdraw(int amount) { lk.acquire(); /* may block */ int b = getBalance(); if(amount > b) try {throw WithdrawTooLargeException;} setBalance(b – amount); lk.release(); } // deposit would also acquire/release lk }

34 Some mistakes A lock & critical sections are very primitive mechanisms –Still up to you to use correctly to implement them Incorrect: Use different locks or different named ciritical sections for withdraw and deposit –Mutual exclusion works only when using same lock Poor performance: Use same lock for every bank account –No simultaneous operations on different accounts Incorrect: Forget to release a lock (blocks other threads forever!) –Previous slide is wrong because of the exception possibility! 34Sophomoric Parallelism & Concurrency, Lecture 4 if(amount > b) { lk.release(); // hard to remember! try {throw WithdrawTooLargeException;} }

35 Other operations If withdraw and deposit use the same lock, then simultaneous calls to these methods are properly synchronized But what about getBalance and setBalance ? –Assume they’re public, which may be reasonable If they don’t acquire the same lock, then a race between setBalance and withdraw could produce a wrong result If they do acquire the same lock, then withdraw would block forever because it tries to acquire a lock it already has 35Sophomoric Parallelism & Concurrency, Lecture 4

36 Re-acquiring locks? Can’t let outside world call setBalance1, not protected by locks Can’t have withdraw call setBalance2, because the locks would be nested We can use intricate re- entrant locking scheme or better yet re-structure the code. Nested locking is not recommended 36Sophomoric Parallelism & Concurrency, Lecture 4 int setBalance1(int x) { balance = x; } int setBalance2(int x) { lk.acquire(); balance = x; lk.release(); } void withdraw(int amount) { lk.acquire(); … setBalanceX(b – amount); lk.release(); }

37 This code is easier to lock You may provide a setBalance() method for external use, but do NOT call it for the withdraw() member function. Instead, protect the direct modification of the data as shown here – avoiding nested locks 37Sophomoric Parallelism & Concurrency, Lecture 4 int setBalance(int x) { lk.acquire(); balance = x; lk.release(); } void withdraw(int amount) { lk.acquire(); … balance = (b – amount); lk.release(); }

38 Demo Walk through of load balanced BankAccount code Discussion points: #pragma omp critical 38Sophomoric Parallelism and Concurrency, Lecture 1

39 OpenMp Critical Section method Works great & is simple to implement Can’t be used as nested function calls where each function sets a critical section. For example foo() contains critical section which calls bar(), while bar() also contains critical section. This will not be allowed by OpenMP runtime! Generally better to avoid nested calls such as foo,bar example above, and as also shown in the Canonical Bank Account Example where withdraw() calls setBalance() Can’t be used with try/catch exception handling (due to multiple exists from code block) If try/catch is required, consider using scoped locking – example to follow. 39Sophomoric Parallelism & Concurrency, Lecture 4

40 Scoped Locking method Works great & is fairly simple to implement More complicated than critical section method Can be used as nested function calls where each function acquires its lock. –Generally it is still better to avoid nested calls such as foo,bar example on previous foil, and as also shown in the Canonical Bank Account Example where withdraw() calls setBalance() Can be used with try/catch exception handling The lock is released when the object or function goes out of scope. 40Sophomoric Parallelism & Concurrency, Lecture 4

41 Demo Walk through of load balanced BankAccount code Discussion points: scoped locking using OpenMP nested locks 41Sophomoric Parallelism and Concurrency, Lecture 1

42 A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 5 Programming with Locks and Critical Sections Dan Grossman Last Updated: May 2011 For more information, see http://www.cs.washington.edu/homes/djg/teachingMaterials/

43 Demo Walk through of load balanced stack code Discussion points: using OpenMP nested locks Using OpenMP critical section 43Sophomoric Parallelism and Concurrency, Lecture 1

44 Example, using critical sections 44Sophomoric Parallelism & Concurrency, Lecture 5 class Stack { private E[] array = (E[])new Object[SIZE]; int index = -1; bool isEmpty() { // unsynchronized: wrong?! return index==-1; } void push(E val) { #pragma omp critical {array[++index] = val;} } E pop() { E temp; #pragma omp critical {temp = array[index--];} return temp; } E peek() { // unsynchronized: wrong! return array[index]; } }

45 Why wrong? It looks like isEmpty and peek can “get away with this” since push and pop adjust the state “in one tiny step” But this code is still wrong and depends on language- implementation details you cannot assume –Even “tiny steps” may require multiple steps in the implementation: array[++index] = val probably takes at least two steps –Code has a data race, allowing very strange behavior Important discussion in next lecture Moral: Don’t introduce a data race, even if every interleaving you can think of is correct 45Sophomoric Parallelism & Concurrency, Lecture 5

46 3 choices For every memory location (e.g., object field) in your program, you must obey at least one of the following: 1.Thread-local: Don’t use the location in > 1 thread 2.Immutable: Don’t write to the memory location 3.Synchronized: Use synchronization to control access to the location 46Sophomoric Parallelism & Concurrency, Lecture 5 all memory thread-local memory immutable memory need synchronization

47 Example Suppose we want to change the value for a key in a hashtable without removing it from the table –Assume critical section guards the whole table 47Sophomoric Parallelism & Concurrency, Lecture 5 #pragma omp critical { v1 = table.lookup(k); v2 = expensive(v1); table.remove(k); table.insert(k,v2); } Papa Bear’s critical section was too long (table locked during expensive call)

48 Example Suppose we want to change the value for a key in a hashtable without removing it from the table –Assume critical section guards the whole table 48Sophomoric Parallelism & Concurrency, Lecture 5 #pragma omp critical { v1 = table.lookup(k); } v2 = expensive(v1); #pragma omp critical { table.remove(k); table.insert(k,v2); } Mama Bear’s critical section was too short (if another thread updated the entry, we will lose an update)

49 Example Suppose we want to change the value for a key in a hashtable without removing it from the table –Assume critical section guards the whole table 49Sophomoric Parallelism & Concurrency, Lecture 5 done = false; while(!done) { #pragma omp critical { v1 = table.lookup(k); } v2 = expensive(v1); #pragma omp critical { if(table.lookup(k)==v1) { done = true; table.remove(k); table.insert(k,v2); }}} Baby Bear’s critical section was just right (if another update occurred, try our update again)

50 Don’t roll your own It is rare that you should write your own data structure –Provided in standard libraries –Point of these lectures is to understand the key trade-offs and abstractions Especially true for concurrent data structures –Far too difficult to provide fine-grained synchronization without race conditions –Standard thread-safe libraries like TBB ConcurrentHashMap written by world experts Guideline #5: Use built-in libraries whenever they meet your needs 50Sophomoric Parallelism & Concurrency, Lecture 5


Download ppt "Selections from CSE332 Data Abstractions course at University of Washington 1.Introduction to Multithreading & Fork-Join Parallelism –hashtable, vector."

Similar presentations


Ads by Google