Handling Branches in TLS Systems with Multi-Path Execution Polychronis Xekalakis and Marcelo Cintra University of Edinburgh

Slides:

Advertisements

Similar presentations

Spatial Computation Thesis committee: Seth Goldstein Peter Lee Todd Mowry Babak Falsafi Nevin Heintze Ph.D. Thesis defense, December 8, 2003 SCS Mihai.

Advertisements

Analyzing Parallel Performance Intel Software College Introduction to Parallel Programming – Part 6.

Feichter_DPG-SYKL03_Bild-01. Feichter_DPG-SYKL03_Bild-02.

1 Vorlesung Informatik 2 Algorithmen und Datenstrukturen (Parallel Algorithms) Robin Pomplun.

© 2008 Pearson Addison Wesley. All rights reserved Chapter Seven Costs.

Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.

Chapter 1 The Study of Body Function Image PowerPoint

Copyright © 2013 Elsevier Inc. All rights reserved.

Copyright © 2011, Elsevier Inc. All rights reserved. Chapter 6 Author: Julia Richards and R. Scott Hawley.

Author: Julia Richards and R. Scott Hawley

1 Copyright © 2013 Elsevier Inc. All rights reserved. Appendix 01.

1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 3 CPUs.

Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.

UNITED NATIONS Shipment Details Report – January 2006.

We need a common denominator to add these fractions.

1 RA I Sub-Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Casablanca, Morocco, 20 – 22 December 2005 Status of observing programmes in RA I.

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13

Title Subtitle.

Properties of Real Numbers CommutativeAssociativeDistributive Identity + × Inverse + ×

FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.

Year 6 mental test 5 second questions

Year 6 mental test 10 second questions

Solve Multi-step Equations

Re-examining Instruction Reuse in Pre-execution Approaches By Sonya R. Wolff Prof. Ronald D. Barnes June 5, 2011.

REVIEW: Arthropod ID. 1. Name the subphylum. 2. Name the subphylum. 3. Name the order.

NC STATE UNIVERSITY Transparent Control Independence (TCI) Ahmed S. Al-Zawawi Vimal K. Reddy Eric Rotenberg Haitham H. Akkary* *Dept. of Electrical & Computer.

OS-aware Tuning Improving Instruction Cache Energy Efficiency on System Workloads Authors : Tao Li, John, L.K. Published in : Performance, Computing, and.

Dynamic Performance Tuning for Speculative Threads Yangchun Luo, Venkatesan Packirisamy, Nikhil Mungre, Ankit Tarkas, Wei-Chung Hsu, and Antonia Zhai Dept.

PP Test Review Sections 6-1 to 6-6

EU market situation for eggs and poultry Management Committee 20 October 2011.

CRUISE: Cache Replacement and Utility-Aware Scheduling

CS 6143 COMPUTER ARCHITECTURE II SPRING 2014 ACM Principles and Practice of Parallel Programming, PPoPP, 2006 Panel Presentations Parallel Processing is.

2 |SharePoint Saturday New York City

Name Convolutional codes Tomashevich Victor. Name- 2 - Introduction Convolutional codes map information to code bits sequentially by convolving a sequence.

1 Breadth First Search s s Undiscovered Discovered Finished Queue: s Top of queue 2 1 Shortest path from s.

1 RA III - Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Buenos Aires, Argentina, 25 – 27 October 2006 Status of observing programmes in RA.

Factor P 16 8(8-5ab) 4(d² + 4) 3rs(2r – s) 15cd(1 + 2cd) 8(4a² + 3b²)

Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.

Combining Thread Level Speculation, Helper Threads, and Runahead Execution Polychronis Xekalakis, Nikolas Ioannou and Marcelo Cintra University of Edinburgh.

Toward a More Accurate Understanding of the Limits of the TLS Execution Paradigm Nikolas Ioannou, Jeremy Singer, Salman Khan, Polychronis Xekalakis, Paraskevas.

© 2012 National Heart Foundation of Australia. Slide 2.

Mixed Speculative Multithreaded Execution Models Marcelo Cintra University of Edinburgh

Increasing the Energy Efficiency of TLS Systems Using Intermediate Checkpointing Salman Khan 1, Nikolas Ioannou 2, Polychronis Xekalakis 3 and Marcelo.

Understanding Generalist Practice, 5e, Kirst-Ashman/Hull

KAIST Computer Architecture Lab. The Effect of Multi-core on HPC Applications in Virtualized Systems Jaeung Han¹, Jeongseob Ahn¹, Changdae Kim¹, Youngjin.

Note to the teacher: Was 28. A. to B. you C. said D. on Note to the teacher: Make this slide correct answer be C and sound to be “said”. to said you on.

Model and Relationships 6 M 1 M M M M M M M M M M M M M M M M

25 seconds left…...

Analyzing Genes and Genomes

SE-292 High Performance Computing

©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.

Essential Cell Biology

Intracellular Compartments and Transport

PSSA Preparation.

Essential Cell Biology

Mani Srivastava UCLA - EE Department Room: 6731-H Boelter Hall Tel: WWW: Copyright 2003.

Energy Generation in Mitochondria and Chlorplasts

Murach’s OS/390 and z/OS JCLChapter 16, Slide 1 © 2002, Mike Murach & Associates, Inc.

Cooperative Cache Scrubbing Jennifer B. Sartor, Wim Heirman, Steve Blackburn*, Lieven Eeckhout, Kathryn S. McKinley^ PACT 2014 * ^

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

ReSlice: Selective Re-execution of Long-retired Misspeculated Instructions Using Forward Slicing Smruti R. Sarangi, Wei Liu, Josep Torrellas, Yuanyuan.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Presentation transcript:

Handling Branches in TLS Systems with Multi-Path Execution Polychronis Xekalakis and Marcelo Cintra University of Edinburgh

HPCA Introduction  Power efficiency, complexity and time-to-market reasons lead to CMPs  Many simple cores = high TLP but low ILP –Ok for throughput computing and embarrassingly parallel applications  Problem: –No benefits for sequential applications –Even for mostly parallel applications Amdahl’s Law limits performance gains with many cores  Solution: Speculative Multithreading (SM)

Speculative Multithreading  Basic Idea: Use idle cores/contexts to speculate on future application needs –TLS: speculatively execute parallel threads –HT/RA: speculatively perform future memory operations –MP: speculatively execute along multiple branch targets  No SM model works best all times  Hardware infrastructure is very similar  Our Idea: Combine SM models and seamlessly exploit (speculative) TLP and/or ILP –In this work: TLS + MP –(for TLS +HT/RA see [ICS’09]) ICS 20093

4 Key Contributions  Analyze branch prediction for TLS Systems  Propose a mixed execution model that combines TLS with MP execution  We show that TLS allows MP to be more aggressive  Our approach outperforms state-of-the-art SM models: –TLS by 9.2% avg. (up to 23.2%) –MP by 28.2 % avg. (up to 138%) HPCA 2010

5 Outline  Introduction  Speculative Multithreaded Models  Analysis of Branch Prediction in TLS  Mixed Execution Model  Experimental Setup and Results  Conclusions HPCA 2010

6 Thread Level Speculation  Compiler deals with: –Task selection –Code generation  HW deals with: –Different context –Spawn threads –Detecting violations –Replaying –Arbitrate commit HPCA 2010 Thread 1 Thread 2 Speculative Time

7 Thread Level Speculation  Benefit: TLP/ILP –TLP (Overlapped Execution) –ILP (Prefetching) HPCA 2010 Thread 1 Thread 2 Speculative Time Overlapped Execution Thread 1 Thread 2 Speculative Time Prefetching

MultiPath Execution  Compiler deals with: –Nothing  HW deals with: –Different context –When to do MP –Discard wrong path 8HPCA 2010 Main Thread MP Mode Time Correct Paths Wrong Paths

MultiPath Execution  Benefit: –ILP (Branch Pred.) 9HPCA 2010 Main Thread Time Correct Paths Wrong Paths Branch Misp. Cost

10 Outline  Introduction  Speculative Multithreaded Models  Analysis of Branch Prediction in TLS  Mixed Execution Model  Experimental Setup and Results  Conclusions HPCA 2010

Impact of Branch Prediction on TLS  TLS emulates wider processor: –Removing mispredictions important (Amdahl) 11HPCA 2010

Branch Entropy for TLS  Much harder for TLS: –History partitioning –History re-order 12HPCA 2010

Increasing the Size of the Branch Predictor  Aliasing not much of a problem  Fundamental limitation is lack of history 13HPCA 2010

Designing a Better Predictor  Predictors that exploit longer histories not necessarily better.. 14HPCA 2010

15 Outline  Introduction  Speculative Multithreaded Models  Analysis of Branch Prediction in TLS  Mixed Execution Model  Experimental Setup and Results  Conclusions HPCA 2010

Mixed Execution Model  When idle resources: – Try MP on top of TLS!!  Map TLS threads on empty cores  Map MP threads on empty contexts (same core)  Minimal extra HW: –Branch confidence estimator –MP bit – thread on MP mode –PATHS – how many outstanding branches –DIR – which path thread followed 16HPCA 2010

Combined TLS/MP Model 17HPCA 2010 Thread 1 Thread 2 Speculative Time

Combined TLS/MP Model 18HPCA 2010 Thread 1 Thread 2 Speculative Time Low Confidence Branch Thread 1 MP: 0 PATHS: 000 DIR: 000

Combined TLS/MP Model 19HPCA 2010 Thread 1a Thread 2 Speculative Time Multi-Path Mode Thread 1a MP: 1 PATHS: 001 DIR: 000 Thread 1b MP: 1 PATHS: 001 DIR: 001 Thread 1b

Combined TLS/MP Model 20HPCA 2010 Thread 1a Thread 2 Speculative Time Branch Resolved Thread 1b Thread 1a MP: 1 PATHS: 001 DIR: 000 Thread 1b MP: 0 PATHS: 000 DIR: 000

Intricacies to be Handled  How do we map TLS/MP threads? –Different mapping policies for TLS threads  Dealing with thread ordering –Correct data forwarding  Dealing with violations –While in “MP-Mode” delay restarts/kills/commits –No squashes on the wrong path  Thread spawning: –Delayed as well – keep contention low 21HPCA 2010

22 Outline  Introduction  Speculative Multithreaded Models  Analysis of Branch Prediction in TLS  Mixed Execution Model  Experimental Setup and Results  Conclusions HPCA 2010

23 Experimental Setup  Simulator, Compiler and Benchmarks: –SESC ( –POSH (Liu et al. PPoPP ‘06) –Spec 2000 Int.  Architecture: –Four way CMP, 4-Issue cores, 6 contexts / core –32K-bit OGEHL, 1KByte BTB, 32-Entry RAS –8 Kbit enhanced JRS confidence estimator –32KB L1 Data (multi-versioned) and Instruction Caches –1MB unified L2 Caches HPCA 2010

24 Comparing TLS, MP and Combined TLS/MP HPCA 2010

25 Comparing TLS, MP and Combined TLS/MP  Additive benefits; no point in doubling the predictor HPCA 2010

26 Comparing TLS, MP and Combined TLS/MP  Additive benefits; no point in doubling the predictor  9.2% over TLS, 28.2% over MP HPCA 2010

Pipeline Flushes  Significant amount of flush reductions  More than base MP! 27HPCA 2010

28 Outline  Introduction  Speculative Multithreaded Models  Analysis of Branch Prediction in TLS  Mixed Execution Model  Experimental Setup and Results  Conclusions HPCA 2010

Also in the Paper …  Detailed HW description  Impact of scheduling  Limiting MP to DP  Effect of scaling  Effect of a better CE 29HPCA 2010

30 Conclusions  CMPs are here to stay: – What about single threaded apps. and apps with significant seq. sections? –We advocate the use of speculative multithreading  Analyzed branch prediction for modern TLS systems  Proposed a new mixed execution model –TLS is nicely complemented by MP  Unified scheme outperforms existing SM models –TLS by 9.2% avg. (up to 23.2%) –MP by 28.2 % avg. (up to 138%) HPCA 2010

Handling Branches in TLS Systems with Multi-Path Execution Polychronis Xekalakis and Marcelo Cintra University of Edinburgh

Backup Slides ICS

Prediction Stats ICS Stat. (%)Bzip2CraftyGapGzipMcfParserTwolfVortexVprAvg. Misp PVN PVP SPEC SENS

Performance Model  Sall = Sseq x Silp x Sovl 1.Compute overall speedup (Sall) 34 Tseq/Tmt

Performance Model  Sall = Sseq x Silp x Sovl 1.Compute overall speedup (Sall) 2.Compute sequential TLS speedup (Sseq) 35 Tseq/T1p

Performance Model  Sall = Sseq x Silp x Sovl 1.Compute overall speedup (Sall) 2.Compute sequential TLS speedup (Sseq) 3.Compute speedup due to ILP (Silp) 36 (T1+T2)/(T1’+T2’)

Performance Model  Sall = Sseq x Silp x Sovl 1.Compute overall speedup (Sall) 2.Compute sequential TLS speedup (Sseq) 3.Compute speedup due to ILP (Silp) 4.Use everything to compute TLP (Sovl) 37 Sall/(Sseq x Silp)