Download presentation
Presentation is loading. Please wait.
Published bySkyla Hardaker Modified over 10 years ago
1
Toward a Compiler Framework for Thread-Level Speculation Marcelo Cintra University of Edinburgh http://www.homepages.inf.ed.ac.uk/mc
2
University of Karlsruhe - January 20062 Why Speculation? Performance of programs ultimately limited by control and data flows Most compiler and architectural optimizations exploit knowledge of control and data flows Techniques based on complete knowledge of control and data flows are reaching their limit Future compiler and architectural optimizations must rely on incomplete knowledge: speculative execution
3
University of Karlsruhe - January 20063 Example: Loop Fusion Original codeOptimized code for (i=0; i<100; i++) { A[i] = … … = A[i] + … } for (i=0; i<100; i++) { A[i] = … } for (i=0; i<100; i++) { … = A[i] + … } unsafe B[i] > i ?? incorrect if (cond) A[B[i]] = … if (cond) A[B[i]] = …
4
University of Karlsruhe - January 20064 Example: Out-of-order Execution Original execution MUL R1, R2, R3 stall … stall ADD R5, R1, R4 ST 1000(R1), R5 stall … stall LD 500(R7), R6 stall … stall Optimized execution MUL R1, R2, R3 LD 500(R7), R6 stall … stall ADD R5, R1, R4 ST 1000(R1), R5 stall … stall 500+R7==1000+R1 ?? unsafe
5
University of Karlsruhe - January 20065 Solution: Speculative Execution Identify potential optimization opportunities Assume no data dependences and perform optimization While speculating buffer unsafe data separately Monitor actual data accesses at run time Detect violations Squash offending execution, discard speculative data, and re-execute or, commit speculative execution and data
6
University of Karlsruhe - January 20066 Why Speculation at Thread Level? Modern architectures support Instruction-Level Speculation, but Depth of speculative execution only spans a few dozen instructions (instruction window) No support for speculative memory operations (specially stores) Speculation not exposed to compiler Must support speculative execution across much larger blocks of instructions (“threads”) and with compiler assistance
7
University of Karlsruhe - January 20067 Outline Motivation Speculative Parallelization Compiler Cost Model –Evaluation –Related Work –Conclusions Current and Future Directions
8
University of Karlsruhe - January 20068 Speculative Parallelization Assume no dependences and execute threads in parallel Track data accesses at run-time Detect cross-thread violations Squash offending threads and restart them for(i=0; i<100; i++) { … = A[L[i]] + … A[K[i]] = … } Iteration J+2 … = A[5]+… A[6] =... Iteration J+1 … = A[2]+… A[2] =... Iteration J … = A[4]+… A[5] =... RAW
9
University of Karlsruhe - January 20069 Squash & restart: re-executing the threads Speculative buffer overflow: speculative buffer is full, thread stalls until becomes non-speculative Inter-thread communication: waiting for value from predecessor thread Dispatch & commit: writing back speculative data into memory Load imbalance: processor waiting for thread to become non-speculative to commit Speculative Parallelization Overheads
10
University of Karlsruhe - January 200610 Squash Overhead 3 4 8 PE0PE1 7 PE3 1 PE2 2 A particular problem in speculative parallelization Data dependence cannot be violated A store appearing after a “later” load (in sequential order) causes a squash If squashed must restart from beginning st ld squash
11
University of Karlsruhe - January 200611 Speculative Buffer Overflow Overhead 3 4 8 PE0PE1 7 PE3 1 PE2 2 commit A particular problem in speculative parallelization Speculatively modified state cannot be allowed to overwrite safe (non- speculative) state, but must be buffered instead If buffer overflow remain idle waiting for predecessor to commit st overflow
12
University of Karlsruhe - January 200612 Load Imbalance Overhead 3 4 8 PE0PE1 commit 7 PE3 1 PE2 commit 2 A different problem in speculative parallelization Due to in-order-commit requirement, a processor cannot start new thread before the current thread commits Remain idle waiting for predecessor to commit commit
13
University of Karlsruhe - January 200613 Factors Causing Load Imbalance Difference in thread workload –different control path: (intrinsic load imbalance) –different data sizes –influence from other overheads. e.g. speculative buffer overflow on a thread leads to longer waiting time on successor threads for() { if () { … } else { … } Workload 1 (W1) Workload 2 (W2)
14
University of Karlsruhe - January 200614 Factors Causing Load Imbalance (contd.) Assignment (locations) of the threads on the processors PE0PE1 PE3 1 PE2 PE0PE1 PE3 1 PE2 commit
15
University of Karlsruhe - January 200615 Outline Motivation Speculative Parallelization Compiler Cost Model –Evaluation –Related Work –Conclusions Current and Future Directions
16
University of Karlsruhe - January 200616 Why a Compiler Cost Model? Speculative parallelization can deliver significant speedup or slowdown –several speculation overheads –some code segments could slowdown the program –we need a smart compiler to choose which program regions to run speculatively based on the expected outcome A prediction of the value of speedup can be useful –e.g. multi-tasking environment program A wants to run speculatively in parallel ( predicted 1.2 ) other programs waiting to be scheduled OS decides it does not pay off
17
University of Karlsruhe - January 200617 Proposed Compiler Model Idea: (extended from [Dou and Cintra, PACT’04]) 1.Compute a table of thread sizes based on all possible execution paths (base thread sizes) 2.Generate new thread sizes for those execution paths which have speculative overheads (overheaded thread sizes) 3.Consider all possible assignments of above sizes on P processors, each weighted by its probability 4.Remove improper assignments and adjust probabilities 5.Compute expected sequential (Tseq est ) and parallel (Tpar est ) execution times 6. S est = Tseq est /Tpar est
18
University of Karlsruhe - January 200618 1. Compute Thread Sizes Based on Execution Paths for() { … if () { … …=X[A[i]] … X[B[i]]=… … } else { … Y[C[i]]=… … } W1, p1 ld 1 st w1 w2 p1 W1 = w1+ w2p1 W2, p2 2 st W2 = w1+ w3p2 w3 w1 W1, p1 ld 1 st w1 w2 p1 W1 = w1+ w2 p1
19
University of Karlsruhe - January 200619 2. Generating New Thread Sizes for Speculative Overheads W1, p1 ld 1 st W2, p2 2 st W2, p2 2 st W1, p1’ ld 1 st W3, p3 ld 3 st W1 W2 p1 p2 W1 = W1+ w W2 = W2 + w W3 = W3 + w p1’ p2 p3 ld 1 st
20
University of Karlsruhe - January 200620 3. Consider All Assignments: the Thread Tuple Model PE0PE1PE2PE3 1111111 2 111 3 PE0PE1PE2PE3 1 PE0PE1PE2PE3 3 33 2 3 33 3 3 3 3 PE0PE1PE2PE3 111 2 11 22
21
University of Karlsruhe - January 200621 3. Consider All Assignments: the Thread Tuple Model (contd.) Three thread sizes W1,W2 and W3, assigned onto 4 processors 81 variations, each called a tuple In general: N thread sizes and P processors N P tuples TupleAssignment Probability 1 1111p1.p1.p1.p1 2 1112p1.p1.p1.p2 3 1113 p1.p1.p1.p3 … … … 80 3332 p3.p3.p3.p2 81 3333 p3.p3.p3.p3
22
University of Karlsruhe - January 200622 4. Remove Improper Assignments and Adjust Probabilities Some assignments can never happen –e.g., squashed and overflowed threads cannot appear in PE0 –e.g., squashed thread can only appear if the “producer” thread appears in a predecessor processor –e.g., overflowed thread can only appear if a thread larger than the time of the stalling store appears in a predecessor processor Probabilities vary across processors –e.g., probability of a squashed thread appearing increases from PE1 to PEP (increased chance that the producer may appear in a predecessor processor)
23
University of Karlsruhe - January 200623 4. Remove Improper Assignments and Adjust Probabilities (contd.) TupleAssignment Probability 1 1111p1.p1.p1.p1 2 1112p1.p1.p1.p2 3 1113 p1.p1.p1.p3 … … … 80 3332 p3.p3.p3.p2 81 3333 p3.p3.p3.p3 Cannot appear p1,0. p1,1. p1,2. p1,3 p1,0. p1,1. p1,2. p2,3 p1,0. p1,1. p1,2. p3,3 Add up to 1
24
University of Karlsruhe - January 200624 5. Compute Sequential and Parallel Execution Times Within a tuple: T seq tuple = ∑ WT par tuple = max( W ) TupleAssignment T seq tuple 1 1111p1.p1.p1.p1 2 1112p1.p1.p1.p2 3 1113 p1.p1.p1.p3 … … … 80 3332 p3.p3.p3.p2 81 3333 p3.p3.p3.p3 ProbabilityT par tuple 4.W1 3.W1 + W2 3.W1 + W3 … W1 + 3.W3 4.W3 W1 W3... W3 ii i in tuple
25
University of Karlsruhe - January 200625 5. Compute Sequential and Parallel Execution Times (contd.) TupleAssignment T seq tuple 1 1111p1.p1.p1.p1 2 1112p1.p1.p1.p2 3 1113 p1.p1.p1.p3 … … … 80 3332 p3.p3.p3.p2 81 3333 p3.p3.p3.p3 ProbabilityT par tuple 4.W1 3.W1 + W2 3.W1 + W3 … W1 + 3.W3 4.W3 W1 W3... W3 T seq est Estimated sequential execution time: T par est Estimated parallel execution time:
26
University of Karlsruhe - January 200626 5. Compute Sequential and Parallel Execution Times (contd.) Estimated sequential execution time: Estimated parallel execution time: where: T seq est =P. ∑ W. p T par est = ∑ p(T par tuple =W ). W p(T par tuple =W ) = ( ∑ p l,k ) – ∑ p(T par tuple =W m ) i=1 NB i=1 N i i i ii O(NB) << enumeration, NB:number of base thread sizes k=0 P-1 l=1 i m=1 i-1 O(N*P+O(p i,j )) O(p i,j ):complexity of computing the p i,j ’ s
27
University of Karlsruhe - January 200627 5. Compute Sequential and Parallel Execution Times (contd.) where p i,j is the probability that thread i appears in processor j and is either equal to p i or is computed for every pair of threads involved in an overhead. Each p i,j is computed in: –O(1) for squash overhead –O(NB) for overflow overhead Thus, all p i,j ’s are computed in O(NB*N*P)=O(N 2 *P) Thus, all p(T par tuple =W i ) are computed in O(N 2 *P) Finally, T par est is then computed in: O(N 2 *P) << enumeration
28
University of Karlsruhe - January 200628 6. Computing the Estimated Speedup S est = Tseq est /Tpar est O(N 2 P): << enumeration (compare with O(N) of PACT’04 model)
29
University of Karlsruhe - January 200629 Outline Motivation Speculative Parallelization Compiler Cost Model –Evaluation –Related Work –Conclusions Current and Future Directions
30
University of Karlsruhe - January 200630 Evaluation Environment Implementation: IR of SUIF1 –high level control structure retained –instructions within basic blocks dismantled Simulation: trace-driven with Simics Architecture: Stanford Hydra CMP –4 single-issue processors –private 16KB L1 cache –private fully associative 2KB speculative buffer –shared on-chip 2MB L2 cache
31
University of Karlsruhe - January 200631 Applications Subset of SPEC2000 benchmarks –4 floating point and 5 integer MinneSPEC reduced input sets: –input size: 2~3 billion instructions –simulated instructions: 100m to 600m Focus on load imbalance and squash overheads –none of the loops suffered from overflow Total of 190 loops –collectively account for about 50% to 100% of sequential execution time of most applications
32
University of Karlsruhe - January 200632 Speedup Distribution Very varied speedup/slowdown behavior
33
University of Karlsruhe - January 200633 Model Accuracy (I): Outcomes Only 23% false positives (performance degradation) Negligible false negatives (missed opportunities) Most speedups/slowdowns correctly predicted by the model
34
University of Karlsruhe - January 200634 Error less than 50% for 77% of the loops Model Accuracy (II): Cumulative Errors Distribution Acceptable errors, but room for improvement
35
University of Karlsruhe - January 200635 Model Accuracy (II): Cumulative Errors Distribution
36
University of Karlsruhe - January 200636 Performance Improvements Mostly better performance than previous policies Close to performance of oracle Can curb performance degradation of naive policy
37
University of Karlsruhe - January 200637 Outline Motivation Speculative Parallelization Compiler Cost Model –Evaluation –Related Work –Conclusions Current and Future Directions
38
University of Karlsruhe - January 200638 Related Work Architectures support speculative parallelization: –Multiscalar processor (Wisconsin); –Hydra (Stanford); –Clustered Speculative multithreaded processor (UPC); –Thread-level Data Speculation (CMU); –MAJC (Sun); –Superthreaded processor (Minnesota); –Multiplex (Purdue); –CMP with speculative multithreading (Illinois)
39
University of Karlsruhe - January 200639 Related Work Compiler support for speculative parallelization: –most of above projects have a compiler branch –thread partitioning, optimizations, based on simple heuristics and/or profiling Recent publications on compiler cost model –Chen et. al. (PPoPP’03) a mathematical model, concentrated on probabilistic points-to –Du et. al. (PLDI’04) cost model of squash overhead based on probability of dependences only intended for a CMP with 2 processors No literature found on cost model with the inclusion of load imbalance and other overheads and for several processors
40
University of Karlsruhe - January 200640 Conclusions Compiler cost model of speculative multithreaded execution Fairly accurate quantitative predictions of speedup: –correctly identify speedup/slowdown in 73% of cases –errors of less than 50% for 77% of the cases Good model-driven selection policy: –usually faster than other policies and within 11% of oracle –can curb performance degradation of naïve policy Can accommodate all other speculative execution overheads –accuracy not as high as PACT’04 results, but still good for a static scheme
41
University of Karlsruhe - January 200641 Outline Motivation Speculative Parallelization Compiler Cost Model –Evaluation –Related Work –Conclusions Current and Future Directions
42
University of Karlsruhe - January 200642 Current and Future Directions Software-only speculative parallelization –Speculatively parallelized Convex Hull problem (CG) (CGA’04) –Parallelizing Minimum Enclosing Circle (CG) and Simultaneous Multiple Sequence Alignment (Bioinformatics) problems –Extending scheme to reduce/eliminate overheads Complete compiler model of overheads –Use probabilistic memory disambiguation analysis to factor squash overhead into model –Use probabilistic cache miss models to factor speculative buffer overflow overhead into model
43
University of Karlsruhe - January 200643 Current and Future Directions Probabilistic memory disambiguation –Extend current points-to, alias, and data flow analyses to generate probability of occurrence of these relations –Necessary infrastructure for all quantitative cost models for data speculation Other speculative multithreading models –Combining speculative parallelization with speculative helper threads
44
University of Karlsruhe - January 200644 Current and Future Directions Speculative compiler optimizations –Perform traditional compiler optimizations in the presence of potential data flow relations (e.g., loop distribution/fusion; hoisting; hyperblock instruction scheduling) –Use spare contexts in SMT (Hyperthreading) processors to run/verify speculative optimization (“helper thread”) –Add TLS support for deep speculation
45
University of Karlsruhe - January 200645 Acknowledgments Research Team and Collaborators –Prof. Diego Llanos (University of Valladolid, Spain) –Prof. Belen Palop (University of Valladolid, Spain) –Jialin Dou –Constantino Ribeiro –Salman Khan –Syamsul Bin Hussin Funding –UK – EPSRC –EC – TRACS –EC – HPC Europa
46
Toward a Compiler Framework for Thread-Level Speculation Marcelo Cintra University of Edinburgh http://www.homepages.inf.ed.ac.uk/mc
47
University of Karlsruhe - January 200647 Squashing Useful work Possibly correct work Wasted correct work Squash overhead Producer i i+j Consumer i+j+1 i+j+2 Wr Rd Time Squashing is very costly
48
University of Karlsruhe - January 200648 5. Compute Sequential and Parallel Execution Times (contd.) Each p i,j is computed as: –Squash overheads –Overflow overheads p i.(1-p ovflow )+p i.p ovflow.(1- ∑ p k ) j, for original size and j 0 (1-p producer ) j.p base +(1-(1-p producer ) j ).p base.p dep, for original size p i,j = p i, for original size and j=0 (1-(1-p producer ) j.p base.p dep, for squashed size k=longer B (( ∑ p k ) j -( ∑ p k ) j ).p ovflow.p base, for overflowed size k=1 wait k=1 wait-1 p i,j =
49
University of Karlsruhe - January 200649 Model Accuracy (III): Squash Prediction
50
University of Karlsruhe - January 200650 Sources of largest errors (top 10%) Source of errorNumberError (%) Incorrect IR workload estimation 454~116 Unknown iteration count(i<P) 354~61 Unknown inner loop iteration count 298~161 Biased conditional 1136
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.