Toward a Compiler Framework for Thread-Level Speculation Marcelo Cintra University of Edinburgh

Toward a Compiler Framework for Thread-Level Speculation Marcelo Cintra University of Edinburgh http://www.homepages.inf.ed.ac.uk/mc

University of Karlsruhe - January 20062 Why Speculation?  Performance of programs ultimately limited by control and data flows  Most compiler and architectural optimizations exploit knowledge of control and data flows  Techniques based on complete knowledge of control and data flows are reaching their limit Future compiler and architectural optimizations must rely on incomplete knowledge: speculative execution

University of Karlsruhe - January 20063 Example: Loop Fusion Original codeOptimized code for (i=0; i<100; i++) { A[i] = … … = A[i] + … } for (i=0; i<100; i++) { A[i] = … } for (i=0; i<100; i++) { … = A[i] + … } unsafe B[i] > i ?? incorrect if (cond) A[B[i]] = … if (cond) A[B[i]] = …

University of Karlsruhe - January 20064 Example: Out-of-order Execution Original execution MUL R1, R2, R3 stall … stall ADD R5, R1, R4 ST 1000(R1), R5 stall … stall LD 500(R7), R6 stall … stall Optimized execution MUL R1, R2, R3 LD 500(R7), R6 stall … stall ADD R5, R1, R4 ST 1000(R1), R5 stall … stall 500+R7==1000+R1 ?? unsafe

University of Karlsruhe - January 20065 Solution: Speculative Execution  Identify potential optimization opportunities  Assume no data dependences and perform optimization  While speculating buffer unsafe data separately  Monitor actual data accesses at run time  Detect violations  Squash offending execution, discard speculative data, and re-execute  or, commit speculative execution and data

University of Karlsruhe - January 20066 Why Speculation at Thread Level?  Modern architectures support Instruction-Level Speculation, but  Depth of speculative execution only spans a few dozen instructions (instruction window)  No support for speculative memory operations (specially stores)  Speculation not exposed to compiler Must support speculative execution across much larger blocks of instructions (“threads”) and with compiler assistance

University of Karlsruhe - January 20067 Outline  Motivation  Speculative Parallelization  Compiler Cost Model –Evaluation –Related Work –Conclusions  Current and Future Directions

University of Karlsruhe - January 20068 Speculative Parallelization  Assume no dependences and execute threads in parallel  Track data accesses at run-time  Detect cross-thread violations  Squash offending threads and restart them for(i=0; i<100; i++) { … = A[L[i]] + … A[K[i]] = … } Iteration J+2 … = A[5]+… A[6] =... Iteration J+1 … = A[2]+… A[2] =... Iteration J … = A[4]+… A[5] =... RAW

University of Karlsruhe - January 20069  Squash & restart: re-executing the threads  Speculative buffer overflow: speculative buffer is full, thread stalls until becomes non-speculative  Inter-thread communication: waiting for value from predecessor thread  Dispatch & commit: writing back speculative data into memory  Load imbalance: processor waiting for thread to become non-speculative to commit Speculative Parallelization Overheads

University of Karlsruhe - January 200610 Squash Overhead 3 4 8 PE0PE1 7 PE3 1 PE2 2  A particular problem in speculative parallelization  Data dependence cannot be violated  A store appearing after a “later” load (in sequential order) causes a squash  If squashed must restart from beginning st ld squash

University of Karlsruhe - January 200611 Speculative Buffer Overflow Overhead 3 4 8 PE0PE1 7 PE3 1 PE2 2 commit  A particular problem in speculative parallelization  Speculatively modified state cannot be allowed to overwrite safe (non- speculative) state, but must be buffered instead  If buffer overflow remain idle waiting for predecessor to commit st overflow

University of Karlsruhe - January 200612 Load Imbalance Overhead 3 4 8 PE0PE1 commit 7 PE3 1 PE2 commit 2  A different problem in speculative parallelization  Due to in-order-commit requirement, a processor cannot start new thread before the current thread commits  Remain idle waiting for predecessor to commit commit

University of Karlsruhe - January 200613 Factors Causing Load Imbalance  Difference in thread workload –different control path: (intrinsic load imbalance) –different data sizes –influence from other overheads.  e.g. speculative buffer overflow on a thread leads to longer waiting time on successor threads for() { if () { … } else { … } Workload 1 (W1) Workload 2 (W2)

University of Karlsruhe - January 200614 Factors Causing Load Imbalance (contd.)  Assignment (locations) of the threads on the processors PE0PE1 PE3 1 PE2 PE0PE1 PE3 1 PE2 commit

University of Karlsruhe - January 200616 Why a Compiler Cost Model?  Speculative parallelization can deliver significant speedup or slowdown –several speculation overheads –some code segments could slowdown the program –we need a smart compiler to choose which program regions to run speculatively based on the expected outcome  A prediction of the value of speedup can be useful –e.g. multi-tasking environment  program A wants to run speculatively in parallel ( predicted 1.2 )  other programs waiting to be scheduled  OS decides it does not pay off

University of Karlsruhe - January 200617 Proposed Compiler Model  Idea: (extended from [Dou and Cintra, PACT’04]) 1.Compute a table of thread sizes based on all possible execution paths (base thread sizes) 2.Generate new thread sizes for those execution paths which have speculative overheads (overheaded thread sizes) 3.Consider all possible assignments of above sizes on P processors, each weighted by its probability 4.Remove improper assignments and adjust probabilities 5.Compute expected sequential (Tseq est ) and parallel (Tpar est ) execution times 6. S est = Tseq est /Tpar est

University of Karlsruhe - January 200618 1. Compute Thread Sizes Based on Execution Paths for() { … if () { … …=X[A[i]] … X[B[i]]=… … } else { … Y[C[i]]=… … } W1, p1 ld 1 st w1 w2 p1 W1 = w1+ w2p1 W2, p2 2 st W2 = w1+ w3p2 w3 w1 W1, p1 ld 1 st w1 w2 p1 W1 = w1+ w2 p1

University of Karlsruhe - January 200619 2. Generating New Thread Sizes for Speculative Overheads W1, p1 ld 1 st W2, p2 2 st W2, p2 2 st W1, p1’ ld 1 st W3, p3 ld 3 st W1 W2 p1 p2 W1 = W1+ w W2 = W2 + w W3 = W3 + w p1’ p2 p3 ld 1 st

University of Karlsruhe - January 200620 3. Consider All Assignments: the Thread Tuple Model PE0PE1PE2PE3 1111111 2 111 3 PE0PE1PE2PE3 1 PE0PE1PE2PE3 3 33 2 3 33 3 3 3 3 PE0PE1PE2PE3 111 2 11 22

University of Karlsruhe - January 200621 3. Consider All Assignments: the Thread Tuple Model (contd.)  Three thread sizes W1,W2 and W3, assigned onto 4 processors  81 variations, each called a tuple  In general: N thread sizes and P processors  N P tuples TupleAssignment Probability 1 1111p1.p1.p1.p1 2 1112p1.p1.p1.p2 3 1113 p1.p1.p1.p3 … … … 80 3332 p3.p3.p3.p2 81 3333 p3.p3.p3.p3

University of Karlsruhe - January 200622 4. Remove Improper Assignments and Adjust Probabilities  Some assignments can never happen –e.g., squashed and overflowed threads cannot appear in PE0 –e.g., squashed thread can only appear if the “producer” thread appears in a predecessor processor –e.g., overflowed thread can only appear if a thread larger than the time of the stalling store appears in a predecessor processor  Probabilities vary across processors –e.g., probability of a squashed thread appearing increases from PE1 to PEP (increased chance that the producer may appear in a predecessor processor)

University of Karlsruhe - January 200623 4. Remove Improper Assignments and Adjust Probabilities (contd.) TupleAssignment Probability 1 1111p1.p1.p1.p1 2 1112p1.p1.p1.p2 3 1113 p1.p1.p1.p3 … … … 80 3332 p3.p3.p3.p2 81 3333 p3.p3.p3.p3 Cannot appear p1,0. p1,1. p1,2. p1,3 p1,0. p1,1. p1,2. p2,3 p1,0. p1,1. p1,2. p3,3 Add up to 1

University of Karlsruhe - January 200624 5. Compute Sequential and Parallel Execution Times  Within a tuple: T seq tuple = ∑ WT par tuple = max( W ) TupleAssignment T seq tuple 1 1111p1.p1.p1.p1 2 1112p1.p1.p1.p2 3 1113 p1.p1.p1.p3 … … … 80 3332 p3.p3.p3.p2 81 3333 p3.p3.p3.p3 ProbabilityT par tuple 4.W1 3.W1 + W2 3.W1 + W3 … W1 + 3.W3 4.W3 W1 W3... W3 ii i in tuple

University of Karlsruhe - January 200625 5. Compute Sequential and Parallel Execution Times (contd.) TupleAssignment T seq tuple 1 1111p1.p1.p1.p1 2 1112p1.p1.p1.p2 3 1113 p1.p1.p1.p3 … … … 80 3332 p3.p3.p3.p2 81 3333 p3.p3.p3.p3 ProbabilityT par tuple 4.W1 3.W1 + W2 3.W1 + W3 … W1 + 3.W3 4.W3 W1 W3... W3 T seq est  Estimated sequential execution time: T par est  Estimated parallel execution time:

University of Karlsruhe - January 200626 5. Compute Sequential and Parallel Execution Times (contd.)  Estimated sequential execution time:  Estimated parallel execution time: where: T seq est =P. ∑ W. p T par est = ∑ p(T par tuple =W ). W p(T par tuple =W ) =  ( ∑ p l,k ) – ∑ p(T par tuple =W m ) i=1 NB i=1 N i i i ii O(NB) << enumeration, NB:number of base thread sizes k=0 P-1 l=1 i m=1 i-1 O(N*P+O(p i,j )) O(p i,j ):complexity of computing the p i,j ’ s

University of Karlsruhe - January 200627 5. Compute Sequential and Parallel Execution Times (contd.)  where p i,j is the probability that thread i appears in processor j and is either equal to p i or is computed for every pair of threads involved in an overhead.  Each p i,j is computed in: –O(1) for squash overhead –O(NB) for overflow overhead  Thus, all p i,j ’s are computed in O(NB*N*P)=O(N 2 *P)  Thus, all p(T par tuple =W i ) are computed in O(N 2 *P)  Finally, T par est is then computed in: O(N 2 *P) << enumeration

University of Karlsruhe - January 200628 6. Computing the Estimated Speedup  S est = Tseq est /Tpar est O(N 2 P): << enumeration (compare with O(N) of PACT’04 model)

University of Karlsruhe - January 200630 Evaluation Environment  Implementation: IR of SUIF1 –high level control structure retained –instructions within basic blocks dismantled  Simulation: trace-driven with Simics  Architecture: Stanford Hydra CMP –4 single-issue processors –private 16KB L1 cache –private fully associative 2KB speculative buffer –shared on-chip 2MB L2 cache

University of Karlsruhe - January 200631 Applications  Subset of SPEC2000 benchmarks –4 floating point and 5 integer  MinneSPEC reduced input sets: –input size: 2~3 billion instructions –simulated instructions: 100m to 600m  Focus on load imbalance and squash overheads –none of the loops suffered from overflow  Total of 190 loops –collectively account for about 50% to 100% of sequential execution time of most applications

University of Karlsruhe - January 200632 Speedup Distribution Very varied speedup/slowdown behavior

University of Karlsruhe - January 200633 Model Accuracy (I): Outcomes Only 23% false positives (performance degradation) Negligible false negatives (missed opportunities) Most speedups/slowdowns correctly predicted by the model

University of Karlsruhe - January 200634 Error less than 50% for 77% of the loops Model Accuracy (II): Cumulative Errors Distribution Acceptable errors, but room for improvement

University of Karlsruhe - January 200635 Model Accuracy (II): Cumulative Errors Distribution

University of Karlsruhe - January 200636 Performance Improvements Mostly better performance than previous policies Close to performance of oracle Can curb performance degradation of naive policy

University of Karlsruhe - January 200638 Related Work  Architectures support speculative parallelization: –Multiscalar processor (Wisconsin); –Hydra (Stanford); –Clustered Speculative multithreaded processor (UPC); –Thread-level Data Speculation (CMU); –MAJC (Sun); –Superthreaded processor (Minnesota); –Multiplex (Purdue); –CMP with speculative multithreading (Illinois)

University of Karlsruhe - January 200639 Related Work  Compiler support for speculative parallelization: –most of above projects have a compiler branch –thread partitioning, optimizations, based on simple heuristics and/or profiling  Recent publications on compiler cost model –Chen et. al. (PPoPP’03)  a mathematical model, concentrated on probabilistic points-to –Du et. al. (PLDI’04)  cost model of squash overhead based on probability of dependences  only intended for a CMP with 2 processors  No literature found on cost model with the inclusion of load imbalance and other overheads and for several processors

University of Karlsruhe - January 200640 Conclusions  Compiler cost model of speculative multithreaded execution  Fairly accurate quantitative predictions of speedup: –correctly identify speedup/slowdown in 73% of cases –errors of less than 50% for 77% of the cases  Good model-driven selection policy: –usually faster than other policies and within 11% of oracle –can curb performance degradation of naïve policy  Can accommodate all other speculative execution overheads –accuracy not as high as PACT’04 results, but still good for a static scheme

University of Karlsruhe - January 200642 Current and Future Directions  Software-only speculative parallelization –Speculatively parallelized Convex Hull problem (CG) (CGA’04) –Parallelizing Minimum Enclosing Circle (CG) and Simultaneous Multiple Sequence Alignment (Bioinformatics) problems –Extending scheme to reduce/eliminate overheads  Complete compiler model of overheads –Use probabilistic memory disambiguation analysis to factor squash overhead into model –Use probabilistic cache miss models to factor speculative buffer overflow overhead into model

University of Karlsruhe - January 200643 Current and Future Directions  Probabilistic memory disambiguation –Extend current points-to, alias, and data flow analyses to generate probability of occurrence of these relations –Necessary infrastructure for all quantitative cost models for data speculation  Other speculative multithreading models –Combining speculative parallelization with speculative helper threads

University of Karlsruhe - January 200644 Current and Future Directions  Speculative compiler optimizations –Perform traditional compiler optimizations in the presence of potential data flow relations (e.g., loop distribution/fusion; hoisting; hyperblock instruction scheduling) –Use spare contexts in SMT (Hyperthreading) processors to run/verify speculative optimization (“helper thread”) –Add TLS support for deep speculation

University of Karlsruhe - January 200645 Acknowledgments  Research Team and Collaborators –Prof. Diego Llanos (University of Valladolid, Spain) –Prof. Belen Palop (University of Valladolid, Spain) –Jialin Dou –Constantino Ribeiro –Salman Khan –Syamsul Bin Hussin  Funding –UK – EPSRC –EC – TRACS –EC – HPC Europa

Toward a Compiler Framework for Thread-Level Speculation Marcelo Cintra University of Edinburgh http://www.homepages.inf.ed.ac.uk/mc

University of Karlsruhe - January 200647 Squashing Useful work Possibly correct work Wasted correct work Squash overhead Producer i i+j Consumer i+j+1 i+j+2 Wr Rd Time Squashing is very costly

University of Karlsruhe - January 200648 5. Compute Sequential and Parallel Execution Times (contd.)  Each p i,j is computed as: –Squash overheads –Overflow overheads p i.(1-p ovflow )+p i.p ovflow.(1- ∑ p k ) j, for original size and j  0 (1-p producer ) j.p base +(1-(1-p producer ) j ).p base.p dep, for original size p i,j = p i, for original size and j=0 (1-(1-p producer ) j.p base.p dep, for squashed size k=longer B (( ∑ p k ) j -( ∑ p k ) j ).p ovflow.p base, for overflowed size k=1 wait k=1 wait-1 p i,j =

University of Karlsruhe - January 200649 Model Accuracy (III): Squash Prediction

University of Karlsruhe - January 200650 Sources of largest errors (top 10%) Source of errorNumberError (%) Incorrect IR workload estimation 454~116 Unknown iteration count(i<P) 354~61 Unknown inner loop iteration count 298~161 Biased conditional 1136

Toward a Compiler Framework for Thread-Level Speculation Marcelo Cintra University of Edinburgh

Similar presentations

Presentation on theme: "Toward a Compiler Framework for Thread-Level Speculation Marcelo Cintra University of Edinburgh"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Toward a Compiler Framework for Thread-Level Speculation Marcelo Cintra University of Edinburgh

Similar presentations

Presentation on theme: "Toward a Compiler Framework for Thread-Level Speculation Marcelo Cintra University of Edinburgh"— Presentation transcript:

Similar presentations

About project

Feedback