Handling Branches in TLS Systems with Multi-Path Execution Polychronis Xekalakis and Marcelo Cintra University of Edinburgh

Handling Branches in TLS Systems with Multi-Path Execution Polychronis Xekalakis and Marcelo Cintra University of Edinburgh http://www.homepages.inf.ed.ac.uk/mc/Projects/VESPA

HPCA 20102 Introduction  Power efficiency, complexity and time-to-market reasons lead to CMPs  Many simple cores = high TLP but low ILP –Ok for throughput computing and embarrassingly parallel applications  Problem: –No benefits for sequential applications –Even for mostly parallel applications Amdahl’s Law limits performance gains with many cores  Solution: Speculative Multithreading (SM)

Speculative Multithreading  Basic Idea: Use idle cores/contexts to speculate on future application needs –TLS: speculatively execute parallel threads –HT/RA: speculatively perform future memory operations –MP: speculatively execute along multiple branch targets  No SM model works best all times  Hardware infrastructure is very similar  Our Idea: Combine SM models and seamlessly exploit (speculative) TLP and/or ILP –In this work: TLS + MP –(for TLS +HT/RA see [ICS’09]) ICS 20093

4 Key Contributions  Analyze branch prediction for TLS Systems  Propose a mixed execution model that combines TLS with MP execution  We show that TLS allows MP to be more aggressive  Our approach outperforms state-of-the-art SM models: –TLS by 9.2% avg. (up to 23.2%) –MP by 28.2 % avg. (up to 138%) HPCA 2010

5 Outline  Introduction  Speculative Multithreaded Models  Analysis of Branch Prediction in TLS  Mixed Execution Model  Experimental Setup and Results  Conclusions HPCA 2010

6 Thread Level Speculation  Compiler deals with: –Task selection –Code generation  HW deals with: –Different context –Spawn threads –Detecting violations –Replaying –Arbitrate commit HPCA 2010 Thread 1 Thread 2 Speculative Time

7 Thread Level Speculation  Benefit: TLP/ILP –TLP (Overlapped Execution) –ILP (Prefetching) HPCA 2010 Thread 1 Thread 2 Speculative Time Overlapped Execution Thread 1 Thread 2 Speculative Time Prefetching

MultiPath Execution  Compiler deals with: –Nothing  HW deals with: –Different context –When to do MP –Discard wrong path 8HPCA 2010 Main Thread MP Mode Time Correct Paths Wrong Paths

MultiPath Execution  Benefit: –ILP (Branch Pred.) 9HPCA 2010 Main Thread Time Correct Paths Wrong Paths Branch Misp. Cost

Impact of Branch Prediction on TLS  TLS emulates wider processor: –Removing mispredictions important (Amdahl) 11HPCA 2010

Branch Entropy for TLS  Much harder for TLS: –History partitioning –History re-order 12HPCA 2010

Increasing the Size of the Branch Predictor  Aliasing not much of a problem  Fundamental limitation is lack of history 13HPCA 2010

Designing a Better Predictor  Predictors that exploit longer histories not necessarily better.. 14HPCA 2010

Mixed Execution Model  When idle resources: – Try MP on top of TLS!!  Map TLS threads on empty cores  Map MP threads on empty contexts (same core)  Minimal extra HW: –Branch confidence estimator –MP bit – thread on MP mode –PATHS – how many outstanding branches –DIR – which path thread followed 16HPCA 2010

Combined TLS/MP Model 17HPCA 2010 Thread 1 Thread 2 Speculative Time

Combined TLS/MP Model 18HPCA 2010 Thread 1 Thread 2 Speculative Time Low Confidence Branch Thread 1 MP: 0 PATHS: 000 DIR: 000

Combined TLS/MP Model 19HPCA 2010 Thread 1a Thread 2 Speculative Time Multi-Path Mode Thread 1a MP: 1 PATHS: 001 DIR: 000 Thread 1b MP: 1 PATHS: 001 DIR: 001 Thread 1b

Combined TLS/MP Model 20HPCA 2010 Thread 1a Thread 2 Speculative Time Branch Resolved Thread 1b Thread 1a MP: 1 PATHS: 001 DIR: 000 Thread 1b MP: 0 PATHS: 000 DIR: 000

Intricacies to be Handled  How do we map TLS/MP threads? –Different mapping policies for TLS threads  Dealing with thread ordering –Correct data forwarding  Dealing with violations –While in “MP-Mode” delay restarts/kills/commits –No squashes on the wrong path  Thread spawning: –Delayed as well – keep contention low 21HPCA 2010

23 Experimental Setup  Simulator, Compiler and Benchmarks: –SESC (http://sesc.sourceforge.net/)http://sesc.sourceforge.net/ –POSH (Liu et al. PPoPP ‘06) –Spec 2000 Int.  Architecture: –Four way CMP, 4-Issue cores, 6 contexts / core –32K-bit OGEHL, 1KByte BTB, 32-Entry RAS –8 Kbit enhanced JRS confidence estimator –32KB L1 Data (multi-versioned) and Instruction Caches –1MB unified L2 Caches HPCA 2010

24 Comparing TLS, MP and Combined TLS/MP HPCA 2010

25 Comparing TLS, MP and Combined TLS/MP  Additive benefits; no point in doubling the predictor HPCA 2010

26 Comparing TLS, MP and Combined TLS/MP  Additive benefits; no point in doubling the predictor  9.2% over TLS, 28.2% over MP HPCA 2010

Pipeline Flushes  Significant amount of flush reductions  More than base MP! 27HPCA 2010

Also in the Paper …  Detailed HW description  Impact of scheduling  Limiting MP to DP  Effect of scaling  Effect of a better CE 29HPCA 2010

30 Conclusions  CMPs are here to stay: – What about single threaded apps. and apps with significant seq. sections? –We advocate the use of speculative multithreading  Analyzed branch prediction for modern TLS systems  Proposed a new mixed execution model –TLS is nicely complemented by MP  Unified scheme outperforms existing SM models –TLS by 9.2% avg. (up to 23.2%) –MP by 28.2 % avg. (up to 138%) HPCA 2010

Handling Branches in TLS Systems with Multi-Path Execution Polychronis Xekalakis and Marcelo Cintra University of Edinburgh http://www.homepages.inf.ed.ac.uk/mc/Projects/VESPA

Backup Slides ICS 200932

Prediction Stats ICS 200933 Stat. (%)Bzip2CraftyGapGzipMcfParserTwolfVortexVprAvg. Misp.5.75.23.35.13.93.4100.36.64.8 PVN22.816.919.524.127.920.823.211.624.421.3 PVP98.297.698.898.699.298.996.499.89898.4 SPEC90.789.189.791.491.89091.388.59190.4 SENS959697.595.496.697.389.599.893.995.7

Performance Model  Sall = Sseq x Silp x Sovl 1.Compute overall speedup (Sall) 34 Tseq/Tmt

Performance Model  Sall = Sseq x Silp x Sovl 1.Compute overall speedup (Sall) 2.Compute sequential TLS speedup (Sseq) 35 Tseq/T1p

Performance Model  Sall = Sseq x Silp x Sovl 1.Compute overall speedup (Sall) 2.Compute sequential TLS speedup (Sseq) 3.Compute speedup due to ILP (Silp) 36 (T1+T2)/(T1’+T2’)

Performance Model  Sall = Sseq x Silp x Sovl 1.Compute overall speedup (Sall) 2.Compute sequential TLS speedup (Sseq) 3.Compute speedup due to ILP (Silp) 4.Use everything to compute TLP (Sovl) 37 Sall/(Sseq x Silp)

Handling Branches in TLS Systems with Multi-Path Execution Polychronis Xekalakis and Marcelo Cintra University of Edinburgh

Similar presentations

Presentation on theme: "Handling Branches in TLS Systems with Multi-Path Execution Polychronis Xekalakis and Marcelo Cintra University of Edinburgh"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Handling Branches in TLS Systems with Multi-Path Execution Polychronis Xekalakis and Marcelo Cintra University of Edinburgh

Similar presentations

Presentation on theme: "Handling Branches in TLS Systems with Multi-Path Execution Polychronis Xekalakis and Marcelo Cintra University of Edinburgh"— Presentation transcript:

Similar presentations

About project

Feedback