Download presentation
Presentation is loading. Please wait.
Published byTaylor Moreland Modified over 10 years ago
1
Combining Thread Level Speculation, Helper Threads, and Runahead Execution Polychronis Xekalakis, Nikolas Ioannou and Marcelo Cintra University of Edinburgh http://www.homepages.inf.ed.ac.uk/mc/Projects/VESPA
2
ICS 20092 Introduction Single core, out-of-order cores don’t scale –Simpler solution: multi-core architectures No speedup for single thread applications –Use Thread Level Speculation to extract TLP –Use Helper Threads or RunAhead to improve ILP However for different apps. (or phases) some models work better than some others Our Proposal: –Combine these execution models –Decide at runtime when to employ them
3
ICS 20093 Contributions Introduce mixed Speculative Multithreading (SM) Execution Models Design one that combines TLS, HT and RA Propose a performance model able to quantify ILP and TLP benefits Unified approach outperforms state-of-the-art SM models: –TLS by 10.2% avg. (up to 41.2%) –RA by 18.3 % avg. (up to 35.2%)
4
ICS 20094 Outline Introduction Speculative Multithreading Models Performance Model Unified Scheme Experimental Setup and Results Conclusions
5
Helper Threads Compiler deals with: –Memory ops miss/ hard- to-predict branches –Backward slices HW deals with: –Spawn threads –Different context –Discard when finished Benefit: –ILP (Prefetch/Warmup) ICS 20095
6
RunAhead Execution Compiler deals with: –Nothing HW deals with: –Different context –When to do RA –VP Memory –Commit/Discard Benefit: –ILP (Prefetch/Warmup) ICS 20096
7
7 Thread Level Speculation Compiler deals with: –Task selection –Code generation HW deals with: –Different context –Spawn threads –Detecting violations –Replaying –Arbitrate commit Benefit: TLP/ILP –TLP (Overlapped Execution) + ILP (Prefetching)
8
ICS 20098 Outline Introduction Speculative Multithreading Models Performance Model Unified Scheme Experimental Setup and Results Conclusions
9
ICS 20099 Understanding Performance Benefits Complex TLS thread interactions, obscure performance benefits Even more true for mixed execution models We need a way to quantify ILP and TLP contributions to bottom-line performance Proposed model: –Able to break benefits in ILP/TLP contributions
10
Performance Model Sall = Sseq x Silp x Sovl 1.Compute overall speedup (Sall) ICS 200910 Tseq/Tmt
11
Performance Model Sall = Sseq x Silp x Sovl 1.Compute overall speedup (Sall) 2.Compute sequential TLS speedup (Sseq) ICS 200911 Tseq/T1p
12
Performance Model Sall = Sseq x Silp x Sovl 1.Compute overall speedup (Sall) 2.Compute sequential TLS speedup (Sseq) 3.Compute speedup due to ILP (Silp) ICS 200912 (T1+T2)/(T1’+T2’)
13
Performance Model Sall = Sseq x Silp x Sovl 1.Compute overall speedup (Sall) 2.Compute sequential TLS speedup (Sseq) 3.Compute speedup due to ILP (Silp) 4.Use everything to compute TLP (Sovl) ICS 200913 Sall/(Sseq x Silp)
14
ICS 200914 Outline Introduction Speculative Multithreading Models Performance Model Unified Scheme Experimental Setup and Results Conclusions
15
Unified Execution Model Can we improve TLS? 1.Some of the threads do not help 2.Slack in usage of cores Improve TLP: –Requires a better compiler Improve ILP: –Combine TLS with another SM ! –Most of the HW common ICS 200915
16
ICS 200916 Combining TLS, HT and RA Start with TLS Provide support to clone TLS threads and convert them to HT Conversion to HT means: –Put them in RA mode –Suppress squashes and do not cause additional squashes –Discard them when they finish No compiler slicing purely HW approach
17
Intricacies to be Handled HT may not prefetch effectively! Dealing with contention –HT threads much faster saturate BW Dealing with thread ordering –TLS imposes total thread order –HT killed squashes TLS threads ICS 200917
18
Creating and Terminating HT Create a HT on a L2 miss we can VP –Use mem. address based confidence estimator –VP only if confident Create a HT if we have a free processor Only allow most speculative thread to clone –Seamless integration of HT with TLS –BUT: if parent no longer the most spec. TLS thread, the HT has to be killed Additionally kill HT when: –Parent/HT thread finishes –HT causes exception ICS 200918
19
ICS 200919 Outline Introduction Speculative Multithreading Models Performance Model Unified Scheme Experimental Setup and Results Conclusions
20
ICS 200920 Experimental Setup Simulator, Compiler and Benchmarks: –SESC (http://sesc.sourceforge.net/)http://sesc.sourceforge.net/ –POSH (Liu et al. PPoPP ‘06) –Spec 2000 Int. Architecture: –Four way CMP, 4-Issue cores –16KB L1 Data (multi-versioned) and Instruction Caches –1MB unified L2 Caches –Inst. window/ROB – 80/104 entries –16KB Last Value Predictor
21
ICS 200921 Comparing TLS, RunAhead and Unified Scheme
22
ICS 200922 Comparing TLS, RunAhead and Unified Scheme Almost additive benefits
23
ICS 200923 Comparing TLS, RunAhead and Unified Scheme Almost additive benefits 10.2% over TLS, 18.3% over RA
24
Understanding the extra ILP Improvements of ILP come from: –Mainly memory –Branch prediction (improvement 0.5%) Focus on memory: –Miss rate on committed path –Clustering of misses (different cost) ICS 200924
25
Normalized Shared Cache Misses All schemes better than sequential Unified 41% better than sequential ICS200925
26
Isolated vs. Clustered Misses. Both TLS + RA Large window machines Unified does even better ICS 200926
27
ICS 200927 Outline Introduction Multithreading Models Performance Model Unified Scheme Experimental Setup and Results Conclusions
28
Also on the paper … Dealing with the load of the system Converting TLS threads to HT Multiple HT Effect of a better VP Detailed comparison of performance model against existing models (Renau et. al ICS ’05) ICS 200928
29
ICS 200929 Conclusions CMPs are here to stay: – What about single threaded apps. and apps with significant seq. sections? Different apps. require different SM techniques –Even within apps. different phases We propose the first mixed execution model –TLS is nicely complemented by HT and RA Our unified scheme outperforms existing SM models –TLS by 10.2% avg. (up to 41.2%) –RA by 18.3 % avg. (up to 35.2%)
30
Combining Thread Level Speculation, Helper Threads, and Runahead Execution Polychronis Xekalakis Nikolas Ioannou and Marcelo Cintra University of Edinburgh http://www.homepages.inf.ed.ac.uk/mc/Projects/VESPA
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.