Combining Thread Level Speculation, Helper Threads, and Runahead Execution Polychronis Xekalakis, Nikolas Ioannou and Marcelo Cintra University of Edinburgh
ICS Introduction Single core, out-of-order cores don’t scale –Simpler solution: multi-core architectures No speedup for single thread applications –Use Thread Level Speculation to extract TLP –Use Helper Threads or RunAhead to improve ILP However for different apps. (or phases) some models work better than some others Our Proposal: –Combine these execution models –Decide at runtime when to employ them
ICS Contributions Introduce mixed Speculative Multithreading (SM) Execution Models Design one that combines TLS, HT and RA Propose a performance model able to quantify ILP and TLP benefits Unified approach outperforms state-of-the-art SM models: –TLS by 10.2% avg. (up to 41.2%) –RA by 18.3 % avg. (up to 35.2%)
ICS Outline Introduction Speculative Multithreading Models Performance Model Unified Scheme Experimental Setup and Results Conclusions
Helper Threads Compiler deals with: –Memory ops miss/ hard- to-predict branches –Backward slices HW deals with: –Spawn threads –Different context –Discard when finished Benefit: –ILP (Prefetch/Warmup) ICS 20095
RunAhead Execution Compiler deals with: –Nothing HW deals with: –Different context –When to do RA –VP Memory –Commit/Discard Benefit: –ILP (Prefetch/Warmup) ICS 20096
7 Thread Level Speculation Compiler deals with: –Task selection –Code generation HW deals with: –Different context –Spawn threads –Detecting violations –Replaying –Arbitrate commit Benefit: TLP/ILP –TLP (Overlapped Execution) + ILP (Prefetching)
ICS Outline Introduction Speculative Multithreading Models Performance Model Unified Scheme Experimental Setup and Results Conclusions
ICS Understanding Performance Benefits Complex TLS thread interactions, obscure performance benefits Even more true for mixed execution models We need a way to quantify ILP and TLP contributions to bottom-line performance Proposed model: –Able to break benefits in ILP/TLP contributions
Performance Model Sall = Sseq x Silp x Sovl 1.Compute overall speedup (Sall) ICS Tseq/Tmt
Performance Model Sall = Sseq x Silp x Sovl 1.Compute overall speedup (Sall) 2.Compute sequential TLS speedup (Sseq) ICS Tseq/T1p
Performance Model Sall = Sseq x Silp x Sovl 1.Compute overall speedup (Sall) 2.Compute sequential TLS speedup (Sseq) 3.Compute speedup due to ILP (Silp) ICS (T1+T2)/(T1’+T2’)
Performance Model Sall = Sseq x Silp x Sovl 1.Compute overall speedup (Sall) 2.Compute sequential TLS speedup (Sseq) 3.Compute speedup due to ILP (Silp) 4.Use everything to compute TLP (Sovl) ICS Sall/(Sseq x Silp)
ICS Outline Introduction Speculative Multithreading Models Performance Model Unified Scheme Experimental Setup and Results Conclusions
Unified Execution Model Can we improve TLS? 1.Some of the threads do not help 2.Slack in usage of cores Improve TLP: –Requires a better compiler Improve ILP: –Combine TLS with another SM ! –Most of the HW common ICS
ICS Combining TLS, HT and RA Start with TLS Provide support to clone TLS threads and convert them to HT Conversion to HT means: –Put them in RA mode –Suppress squashes and do not cause additional squashes –Discard them when they finish No compiler slicing purely HW approach
Intricacies to be Handled HT may not prefetch effectively! Dealing with contention –HT threads much faster saturate BW Dealing with thread ordering –TLS imposes total thread order –HT killed squashes TLS threads ICS
Creating and Terminating HT Create a HT on a L2 miss we can VP –Use mem. address based confidence estimator –VP only if confident Create a HT if we have a free processor Only allow most speculative thread to clone –Seamless integration of HT with TLS –BUT: if parent no longer the most spec. TLS thread, the HT has to be killed Additionally kill HT when: –Parent/HT thread finishes –HT causes exception ICS
ICS Outline Introduction Speculative Multithreading Models Performance Model Unified Scheme Experimental Setup and Results Conclusions
ICS Experimental Setup Simulator, Compiler and Benchmarks: –SESC ( –POSH (Liu et al. PPoPP ‘06) –Spec 2000 Int. Architecture: –Four way CMP, 4-Issue cores –16KB L1 Data (multi-versioned) and Instruction Caches –1MB unified L2 Caches –Inst. window/ROB – 80/104 entries –16KB Last Value Predictor
ICS Comparing TLS, RunAhead and Unified Scheme
ICS Comparing TLS, RunAhead and Unified Scheme Almost additive benefits
ICS Comparing TLS, RunAhead and Unified Scheme Almost additive benefits 10.2% over TLS, 18.3% over RA
Understanding the extra ILP Improvements of ILP come from: –Mainly memory –Branch prediction (improvement 0.5%) Focus on memory: –Miss rate on committed path –Clustering of misses (different cost) ICS
Normalized Shared Cache Misses All schemes better than sequential Unified 41% better than sequential ICS200925
Isolated vs. Clustered Misses. Both TLS + RA Large window machines Unified does even better ICS
ICS Outline Introduction Multithreading Models Performance Model Unified Scheme Experimental Setup and Results Conclusions
Also on the paper … Dealing with the load of the system Converting TLS threads to HT Multiple HT Effect of a better VP Detailed comparison of performance model against existing models (Renau et. al ICS ’05) ICS
ICS Conclusions CMPs are here to stay: – What about single threaded apps. and apps with significant seq. sections? Different apps. require different SM techniques –Even within apps. different phases We propose the first mixed execution model –TLS is nicely complemented by HT and RA Our unified scheme outperforms existing SM models –TLS by 10.2% avg. (up to 41.2%) –RA by 18.3 % avg. (up to 35.2%)
Combining Thread Level Speculation, Helper Threads, and Runahead Execution Polychronis Xekalakis Nikolas Ioannou and Marcelo Cintra University of Edinburgh