EECS 583 – Class 18 Research Topic 1 Breaking Dependences, Dynamic Parallelization University of Michigan November 21, 2011.

1 EECS 583 – Class 18 Research Topic 1 Breaking Dependences, Dynamic Parallelization University of Michigan November 21, 2011

2 - 1 - Announcements & Reading Material v No class on Wednes (no paper summary either!) v We are grading the exams v Today’s class reading »“ Spice: Speculative Parallel Iteration Chunk Execution,” E. Raman, N. Vachharajani, R. Rangan, and D. I. August, Proc 2008 Intl. Symposium on Code Generation and Optimization, April 2008. v Next class reading (Monday, Nov 28) »“Exploiting coarse-grained task, data, and pipeline parallelism in stream programs,” M. Gordon, W. Thies, and S. Amarasinghe, Proc. of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems, Oct. 2006.

3 - 2 - The Data Dependence Problem while (ptr != NULL) {... ptr = ptr->next; sum = sum + foo; } How to deal with control dependences? How to deal with linked data structures? How to remove programmatic dependences?

4 - 3 - sum2 += x sum1 += x We Know How to Break Some of These Dependences – Recall ILP Optimizations sum+=x sum = sum1 + sum2 Thread 1 Thread 0 Apply accumulator variable expansion!

5 - 4 - DOALL Coverage – Provable and Profiled Still not good enough! Few dependences hinder parallelization in many loops

6 - 5 - 1: while (node) { 2: work(node); 3: node = node->next; } Speculative Loop Fission 1: while (node) { 4: node_array[count++] = node; 3: node = node->next; } XBEGIN 5: node = node_array[IS]; i = 0; 1 ' :while (node && i++ < CS) { 2: work(node); 3 ' : node = node->next; } RECV(THREAD j-1 ) XCOMMIT if (node!= node_array[IS+CS]){ update_node_array; kill_other_threads();} SEND(THREAD j+1 ) } XBEGIN 5: node = node_array[IS]; i = 0; 1 ' :while (node && i++ < CS) { 2: work(node); 3 ' : node = node->next; } RECV(THREAD j-1 ) XCOMMIT SEND(THREAD j+1 ) } If this were traversing an array, it would be a DOALL loop Separate out data structure access and work Parallelize work Sequential part Parallel part

7 - 6 - 1: while (node) { 2: work(node); 3: node = node->next; } Execution of Fissed Loop 1: while (node) { 4: node_array[count++] = node; 3: node = node->next; } XBEGIN 5: node = node_array[IS]; i = 0; 1 ' :while (node && i++ < CS) { 2: work(node); 3 ' : node = node->next; } RECV(THREAD j-1 ) XCOMMIT if (node!= node_array[IS+CS]){ update_node_array; kill_other_threads();} SEND(THREAD j+1 ) } XBEGIN 5: node = node_array[IS]; i = 0; 1 ' :while (node && i++ < CS) { 2: work(node); 3 ' : node = node->next; } RECV(THREAD j-1 ) XCOMMIT SEND(THREAD j+1 ) }

8 - 7 - Infrequent Dependence Isolation 1: 2: 1: 2: 99% 1% break A B C A B C’ C 1% 99%

9 - 8 - Infrequent Dependence Isolation - cont for( j=0; j<=nstate; ++j ){ if( tystate[j] == 0 ) continue; if( tystate[j] == best ) continue; count = 0; cbest = tystate[j]; for (k=j; k<=nstate; ++k) if (tystate[k]==cbest) ++count; if ( count > times) { best = cbest; times = count; } if ( count > times) { best = cbest; times = count; } j=0; while (j<=nstate){ for( ; j<=nstate; ++j ){ if( tystate[j] == 0 ) continue; if( tystate[j] == best ) continue; count = 0; cbest = tystate[j]; for (k=j; k<=nstate; ++k) if (tystate[k]==cbest) ++count; if ( count > times) break; } if (count > times) { best = cbest; times = count; j++; } } 1 % Sample loop from yacc benchmark

10 - 9 - Speculative Prematerialization v Iterations are sequential due to distance 1 recurrence of the red variables v Decouple iteration chunks by prematerializing last outside the loop

11 - 10 - DOALL Coverage – Profiled and Transformed

12 - 11 - Coverage Breakdown

13 - 12 - Improving DSWP with Value Speculation Motivating Example l = create_list(); while (...){ process_list(l); modify_list(l); } while (c != null){ c1 = c; c = c->next_cl; int w = c1->pick_weight; if (w < wm){ wm = w; cm = c1; }

14 - 13 - Need to Break Cross-thread Dependences Thread 1 Thread 2 value predict c Thread 1 Thread 2 Synchronization Value Speculation 1 2 4 6 53 1 2 4 6 53 value predict c

15 - 14 - Value Speculation 0x1000x1800x1200x2000x2200x300 value predict c Thread 1 Thread 2 1 2 3 4 42 6 5

16 - 15 - Value Speculation for TLP: Observation 1 v Predicting a value in a particular iteration is harder than predicting a value of the operation within a range of iterations 12300 April 15April 30

17 - 16 - Value Speculation for TLP: Observation 2 v It is sufficient to predict the values in a few iterations to get TLP 0x1000x1800x1200x2000x2200x300 value predict c = 0x200 Thread 1 Thread 2 1 4 5 6 32

18 - 17 - Memoize and Speculative Parallel Execution Core 1 Core 2 1 1 23456 23 4 5 6 SVA Memory

19 - 18 - Speculative Parallel Execution Core 1 Core 2 1 1 23456 2 34 5 6 7 7 Memory

20 - 19 - Mis-speculation Core 1 Core 2 1 1 23456 23 4 5 656 Memory

21 - 20 - Memoizing Once: Problems – Delete One of the Starting Nodes 123456

22 - 21 - 123456 789101112 Memoizing Once: Problems – The List Grows after the Initial Traversal

23 - 22 - Discussion Point 1 – Fission vs Value Speculation v When is fission better? v When is value speculation better? v Is either more powerful than the other? v Which would you use in your compiler and why? v Would you be confident your code would run correctly if these transformations were done?

24 - 23 - Discussion Point 2 – Dependences v Are these “tricks” valid for other common data structures? »Are any generalizations needed? v What other common data dependences are we missing in loops?

25 - 24 - Client-side Computation in JavaScript v Flexibility, ease of prototyping, and portability 24 Poor performance is one of the main challenges

26 - 25 - Client-side Applications 25 Interaction-intensive: –Largely composed of event handlers, triggered by user –Examples are Gmail, Facebook, etc. Compute-intensive: –Dominated by loops and hot functions –Online image editing such as Adobe’s, Google’s Picnik –Lot more potential: Online games Video editing Sound editing and voice recognition

27 - 26 - JavaScript Parallelization v A typical static parallelization flow 26 Source code Memory dependence analysis Parallel code generation Parallel execution Compile time Runtime Memory profiling Data flow analysis Runtime Dynamic parallelization $ $ Speculation engine (Software transactional memory) $

28 - 27 - ParaScript Approach v Light-weight dynamic analysis & code generation for speculative DOALL loops v Low-cost customized SW speculation with a single checkpoint 27 Hot loop detection Initial parallelizabilit y assessment Parallel Code generation Parallel execution Sequential execution Loop Selection Runtime Customize d speculation Finish Abort

29 - 28 - Dependence Analysis 28 JIT compilation time data flow analysis Runtime initial tests + range-based monitoring Runtime reference-counting-based monitoring

30 - 29 - Scalar Array Conflict Detection v Initial assessment catches trivial conflicts v Keep track of max and min accessed element indices v Cross-check RD/WR sets after thread execution 29 A[0] = … A[5] = … B[7] = … A[6] = A[5]+1 Thread 1 Thread 2 &A05 Array write-set &A66 Array write-set &B77 &A55 Array read-set ptr min max

31 - 30 - Object Array Conflict Detection v More involved than scalar arrays v Different indices of the same array may point to the same object 30 ptr RefCnt myObj0 header ptr RefCnt myObj1 header A &A 1 1 B &B1 2 If dependent based on data-flow analysis If dependent based on data-flow analysis

32 - 31 - Loop Selection  Focus on DOALL-counted (e.g. for loops) v Avoid parallelizing loops with: »Browser interactions »HTTP request functions »Runtime code insertion 31 var addFunction = new Function("a", "b“, "return a+b;"); Function addFunction(a, b){ return a+b; } eval("a = 7; b = 13; document.write(a+b);"); a = 7; b = 13; document.write(a+b); Requires locks on browser internals Requires locks on browser internals Requires server-side speculation Requires server-side speculation

33 - 32 - Checkpointing Mechanism v Go through all references, clone them, and ask GC not to touch the clones 32 Cloning process Strings, numbers and Booleans Copy the values Custom objects Deep-copy all properties, avoid recursion Functions - Same as custom objects - No need to clone source code Arrays Clone all properties and elements Monitor overhead, back out if more expensive than a threshold

34 - 33 - Checkpointing Optimizations v Selective variable cloning »Only clone a variable if it is touched during speculative execution v Array clone elimination »Large arrays holding results of browser functions »Instead of cloning the array, just call the function again for recovery  E.g. getImageData in the canvas HTML5 element 33

35 - 34 - Experimental Setup v Implemented in Firefox 3.7a1pre v Subset of SunSpider benchmark suite »Others identified as not parallelizable early on, causing 2% slow-down due to the initial analysis. v A set of Pixastic Image Processing filters v 8-processor system -- 2 Intel Xeon Quad-cores, running Ubuntu 9.10 v Ran each benchmark 10 times and took the average 34

36 - 35 - Parallelism Coverage 35 High fraction of sequential execution in the getImageData() browser function DOALL loop that extracts pixel RGB & alpha values High fraction of sequential execution in the getImageData() browser function DOALL loop that extracts pixel RGB & alpha values

37 - 36 - SunSpider 36 A long iteration dominates execution

38 - 37 - Pixastic Image Processing 37 2 threads 4 threads 8 threads 1 thread High memory op to computation ratio

