Download presentation
Presentation is loading. Please wait.
Published byMarilyn Perry Modified over 8 years ago
1
EECS 583 – Class 18 Research Topic 1 Breaking Dependences, Dynamic Parallelization University of Michigan November 14, 2012
2
- 1 - Announcements & Reading Material v Exam next Monday, Nov 19, 10:40-12:30 »Check room assignments on course website v No class next Wednes (Nov 21) »Sleep in and recover from the exam! v Today’s class reading »“ Spice: Speculative Parallel Iteration Chunk Execution,” E. Raman, N. Vachharajani, R. Rangan, and D. I. August, Proc 2008 Intl. Symposium on Code Generation and Optimization, April 2008. v Next class reading (Monday, Nov 26) »“Exploiting coarse-grained task, data, and pipeline parallelism in stream programs,” M. Gordon, W. Thies, and S. Amarasinghe, Proc. of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems, Oct. 2006.
3
- 2 - The Data Dependence Problem while (ptr != NULL) {... ptr = ptr->next; sum = sum + foo; } How to deal with control dependences? How to deal with linked data structures? How to remove programmatic dependences?
4
- 3 - sum2 += x sum1 += x We Know How to Break Some of These Dependences – Recall ILP Optimizations sum+=x sum = sum1 + sum2 Thread 1 Thread 0 Apply accumulator variable expansion!
5
- 4 - DOALL Coverage – Provable and Profiled Still not good enough! Few dependences hinder parallelization in many loops
6
- 5 - 1: while (node) { 2: work(node); 3: node = node->next; } Speculative Loop Fission 1: while (node) { 4: node_array[count++] = node; 3: node = node->next; } XBEGIN 5: node = node_array[IS]; i = 0; 1 ' :while (node && i++ < CS) { 2: work(node); 3 ' : node = node->next; } RECV(THREAD j-1 ) XCOMMIT if (node!= node_array[IS+CS]){ update_node_array; kill_other_threads();} SEND(THREAD j+1 ) } XBEGIN 5: node = node_array[IS]; i = 0; 1 ' :while (node && i++ < CS) { 2: work(node); 3 ' : node = node->next; } RECV(THREAD j-1 ) XCOMMIT SEND(THREAD j+1 ) } If this were traversing an array, it would be a DOALL loop Separate out data structure access and work Parallelize work Sequential part Parallel part
7
- 6 - 1: while (node) { 2: work(node); 3: node = node->next; } Execution of Fissed Loop 1: while (node) { 4: node_array[count++] = node; 3: node = node->next; } XBEGIN 5: node = node_array[IS]; i = 0; 1 ' :while (node && i++ < CS) { 2: work(node); 3 ' : node = node->next; } RECV(THREAD j-1 ) XCOMMIT if (node!= node_array[IS+CS]){ update_node_array; kill_other_threads();} SEND(THREAD j+1 ) } XBEGIN 5: node = node_array[IS]; i = 0; 1 ' :while (node && i++ < CS) { 2: work(node); 3 ' : node = node->next; } RECV(THREAD j-1 ) XCOMMIT SEND(THREAD j+1 ) }
8
- 7 - Infrequent Dependence Isolation 1: 2: 1: 2: 99% 1% break A B C A B C’ C 1% 99%
9
- 8 - Infrequent Dependence Isolation - cont for( j=0; j<=nstate; ++j ){ if( tystate[j] == 0 ) continue; if( tystate[j] == best ) continue; count = 0; cbest = tystate[j]; for (k=j; k<=nstate; ++k) if (tystate[k]==cbest) ++count; if ( count > times) { best = cbest; times = count; } if ( count > times) { best = cbest; times = count; } j=0; while (j<=nstate){ for( ; j<=nstate; ++j ){ if( tystate[j] == 0 ) continue; if( tystate[j] == best ) continue; count = 0; cbest = tystate[j]; for (k=j; k<=nstate; ++k) if (tystate[k]==cbest) ++count; if ( count > times) break; } if (count > times) { best = cbest; times = count; j++; } } 1 % Sample loop from yacc benchmark
10
- 9 - Speculative Prematerialization v Iterations are sequential due to distance 1 recurrence of the red variables v Decouple iteration chunks by prematerializing last outside the loop
11
- 10 - DOALL Coverage – Profiled and Transformed
12
- 11 - Coverage Breakdown
13
- 12 - Improving DSWP with Value Speculation Motivating Example l = create_list(); while (...){ process_list(l); modify_list(l); } while (c != null){ c1 = c; c = c->next_cl; int w = c1->pick_weight; if (w < wm){ wm = w; cm = c1; }
14
- 13 - Need to Break Cross-thread Dependences Thread 1 Thread 2 value predict c Thread 1 Thread 2 Synchronization Value Speculation 1 2 4 6 53 1 2 4 6 53 value predict c
15
- 14 - Value Speculation 0x1000x1800x1200x2000x2200x300 value predict c Thread 1 Thread 2 1 2 3 4 42 6 5
16
- 15 - Value Speculation for TLP: Observation 1 v Predicting a value in a particular iteration is harder than predicting a value of the operation within a range of iterations 12300 April 15April 30
17
- 16 - Value Speculation for TLP: Observation 2 v It is sufficient to predict the values in a few iterations to get TLP 0x1000x1800x1200x2000x2200x300 value predict c = 0x200 Thread 1 Thread 2 1 4 5 6 32
18
- 17 - Memoize and Speculative Parallel Execution Core 1 Core 2 1 1 23456 23 4 5 6 SVA Memory
19
- 18 - Speculative Parallel Execution Core 1 Core 2 1 1 23456 2 34 5 6 7 7 Memory
20
- 19 - Mis-speculation Core 1 Core 2 1 1 23456 23 4 5 656 Memory
21
- 20 - Memoizing Once: Problems – Delete One of the Starting Nodes 123456
22
- 21 - 123456 789101112 Memoizing Once: Problems – The List Grows after the Initial Traversal
23
- 22 - Discussion Point 1 – Fission vs Value Speculation v Fission vs value speculation »When is fission better? »When is value speculation better? »Is either more powerful than the other? »Which would you use in your compiler and why? »Would you be confident your code would run correctly if these transformations were done? v Memoization/caching is a powerful optimization »Where is this used today? Can it be used more often?
24
- 23 - Discussion Point 2 – Dependences v Are these “tricks” valid for other common data structures? »Are any generalizations needed? v What other common data dependences are we missing in loops?
25
Dynamic Parallelization – Is this Practical?
26
- 25 - Client-side Computation in JavaScript v Flexibility, ease of prototyping, and portability 25 Poor performance is one of the main challenges
27
- 26 - JavaScript Parallelization v A typical static parallelization flow 26 Source code Memory dependence analysis Parallel code generation Parallel execution Compile time Runtime Memory profiling Data flow analysis Runtime Dynamic parallelization $ $ Speculation engine (Software transactional memory) $
28
- 27 - ParaScript Approach v Light-weight dynamic analysis & code generation for speculative DOALL loops v Low-cost customized SW speculation with a single checkpoint 27 Hot loop detection Initial parallelizabilit y assessment Parallel Code generation Parallel execution Sequential execution Loop Selection Runtime Customize d speculation Finish Abort
29
- 28 - Scalar Array Conflict Detection v Initial assessment catches trivial conflicts v Keep track of max and min accessed element indices v Cross-check RD/WR sets after thread execution 28 A[0] = … A[5] = … B[7] = … A[6] = A[5]+1 Thread 1 Thread 2 &A05 Array write-set &A66 Array write-set &B77 &A55 Array read-set ptr min max
30
- 29 - Object Array Conflict Detection v More involved than scalar arrays v Different indices of the same array may point to the same object 29 ptr RefCnt myObj0 header ptr RefCnt myObj1 header A &A 1 1 B &B1 2 If dependent based on data-flow analysis If dependent based on data-flow analysis
31
- 30 - Experimental Setup v Implemented in Firefox 3.7a1pre v Subset of SunSpider benchmark suite »Others identified as not parallelizable early on, causing 2% slow-down due to the initial analysis. v A set of Pixastic Image Processing filters v 8-processor system -- 2 Intel Xeon Quad-cores, running Ubuntu 9.10 v Ran each benchmark 10 times and took the average 30
32
- 31 - Parallelism Coverage 31 High fraction of sequential execution in the getImageData() browser function DOALL loop that extracts pixel RGB & alpha values High fraction of sequential execution in the getImageData() browser function DOALL loop that extracts pixel RGB & alpha values
33
- 32 - SunSpider 32 A long iteration dominates execution
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.