Combining Thread Level Speculation, Helper Threads, and Runahead Execution Polychronis Xekalakis, Nikolas Ioannou and Marcelo Cintra University of Edinburgh.

Slides:

Advertisements

Similar presentations

COMP 110: Introduction to Programming Tyler Johnson Feb 25, 2009 MWF 11:00AM-12:15PM Sitterson 014.

Advertisements

COMP 110: Introduction to Programming Tyler Johnson Mar 16, 2009 MWF 11:00AM-12:15PM Sitterson 014.

COMP 110: Introduction to Programming Tyler Johnson Apr 13, 2009 MWF 11:00AM-12:15PM Sitterson 014.

COMP 110: Introduction to Programming Tyler Johnson January 12, 2009 MWF 11:00AM-12:15PM Sitterson 014.

COMP 110: Introduction to Programming Tyler Johnson Mar 25, 2009 MWF 11:00AM-12:15PM Sitterson 014.

COMP 110: Introduction to Programming Tyler Johnson Apr 27, 2009 MWF 11:00AM-12:15PM Sitterson 014.

Dynamic Power Redistribution in Failure-Prone CMPs Paula Petrica, Jonathan A. Winter * and David H. Albonesi Cornell University *Google, Inc.

Large Scale Integration of Senses for the Semantic Web Jorge Gracia, Mathieu dAquin, Eduardo Mena Computer Science and Systems Engineering Department (DIIS)

Producing monthly estimates of labour market indicators exploiting the longitudinal dimension of the LFS microdata R. Gatto, S. Loriga, A. Spizzichino.

Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,

Re-examining Instruction Reuse in Pre-execution Approaches By Sonya R. Wolff Prof. Ronald D. Barnes June 5, 2011.

Wouter Noordkamp The assessment of new platforms on operational performance and manning concepts.

Dynamic Thread Mapping for High- Performance, Power-Efficient Heterogeneous Many-core Systems Guangshuo Liu Jinpyo Park Diana Marculescu Presented By Ravi.

Dynamic Performance Tuning for Speculative Threads Yangchun Luo, Venkatesan Packirisamy, Nikhil Mungre, Ankit Tarkas, Wei-Chung Hsu, and Antonia Zhai Dept.

Automation Solutions for Ladle Gate Applications

1 Cathay Life Insurance Ltd. (Vietnam) 27/11/20091.

Thread-Level Speculation as a Memory Consistency Protocol for Software DSM? Marcelo Cintra University of Edinburgh

Beyond Auto-Parallelization: Compilers for Many-Core Systems Marcelo Cintra University of Edinburgh

Handling Branches in TLS Systems with Multi-Path Execution Polychronis Xekalakis and Marcelo Cintra University of Edinburgh

Toward a More Accurate Understanding of the Limits of the TLS Execution Paradigm Nikolas Ioannou, Jeremy Singer, Salman Khan, Polychronis Xekalakis, Paraskevas.

Presented by: Yaseen Ali, Suraj Bhardwaj, Rohan Shah Yaseen Ali, Suraj Bhardwaj, Rohan Shah Mechatronics Engineering Group 302 Instructor: Dr. K. Sartipi.

Mixed Speculative Multithreaded Execution Models Marcelo Cintra University of Edinburgh

Eliminating Squashes Through Learning Cross-Thread Violations in Speculative Parallelization for Multiprocessors Marcelo Cintra and Josep Torrellas University.

Toward a Compiler Framework for Thread-Level Speculation Marcelo Cintra University of Edinburgh

Increasing the Energy Efficiency of TLS Systems Using Intermediate Checkpointing Salman Khan 1, Nikolas Ioannou 2, Polychronis Xekalakis 3 and Marcelo.

Vault 9 Project Update 9 th September 2009 Paul Pointon – Site Project Delivery Manager LLW Repository Ltd.

30 min Scratch July min intro to Scratch A Quick-and-Dirty approach Leaving lots of exploration for the future. (5 hour lesson plan available)

Flexible Scheduling of Software with Logical Execution Time Constraints* Stefan Resmerita and Patricia Derler University of Salzburg, Austria *UC Berkeley,

DBMSs on a Modern Processor: Where Does Time Go? Anastassia Ailamaki Joint work with David DeWitt, Mark Hill, and David Wood at the University of Wisconsin-Madison.

PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

Green Governors: A Framework for Continuously Adaptive DVFS Vasileios Spiliopoulos, Stefanos Kaxiras Uppsala University, Sweden.

WCED: June 7, 2003 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-MadisonSlide 1 of 26 Exploring Efficient SMT Branch Predictor Design.

UPC Reducing Misspeculation Penalty in Trace-Level Speculative Multithreaded Architectures Carlos Molina ψ, ф Jordi Tubella ф Antonio González λ,ф ISHPC-VI,

Wish Branches A Review of “Wish Branches: Enabling Adaptive and Aggressive Predicated Execution” Russell Dodd - October 24, 2006.

Address-Value Delta (AVD) Prediction Onur Mutlu Hyesoon Kim Yale N. Patt.

Multiscalar processors

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.

1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power ISCA workshops Sign up for class presentations.

UPC Trace-Level Speculative Multithreaded Architecture Carlos Molina Universitat Rovira i Virgili – Tarragona, Spain Antonio González.

Revisiting Load Value Speculation:

An Analytical Performance Model for Co-Management of Last-Level Cache and Bandwidth Sharing Taecheol Oh, Kiyeon Lee, and Sangyeun Cho Computer Science.

Boosting Mobile GPU Performance with a Decoupled Access/Execute Fragment Processor José-María Arnau, Joan-Manuel Parcerisa (UPC) Polychronis Xekalakis.

Adapting Convergent Scheduling Using Machine Learning Diego Puppin*, Mark Stephenson †, Una-May O’Reilly †, Martin Martin †, and Saman Amarasinghe † *

Architectural Support for Scalable Speculative Parallelization in Shared- Memory Multiprocessors Marcelo Cintra, José F. Martínez, Josep Torrellas Department.

Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University.

Ramazan Bitirgen, Engin Ipek and Jose F.Martinez MICRO’08 Presented by PAK,EUNJI Coordinated Management of Multiple Interacting Resources in Chip Multiprocessors.

Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.

(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)

ACSAC’04 Choice Predictor for Free Mongkol Ekpanyapong Pinar Korkmaz Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Institute.

ReSlice: Selective Re-execution of Long-retired Misspeculated Instructions Using Forward Slicing Smruti R. Sarangi, Wei Liu, Josep Torrellas, Yuanyuan.

Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.

CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR

Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.

Pipelining and Parallelism Mark Staveley

Computer Architecture: Multithreading (IV) Prof. Onur Mutlu Carnegie Mellon University.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

Sunpyo Hong, Hyesoon Kim

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

Dynamic Region Selection for Thread Level Speculation Presented by: Jeff Da Silva Stanley Fung Martin Labrecque Feb 6, 2004 Builds on research done by:

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Address-Value Delta (AVD) Prediction

Computer Architecture: Multithreading (IV)

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

Mengjia Yan† , Jiho Choi† , Dimitrios Skarlatos,

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Presentation transcript:

Combining Thread Level Speculation, Helper Threads, and Runahead Execution Polychronis Xekalakis, Nikolas Ioannou and Marcelo Cintra University of Edinburgh

ICS Introduction  Single core, out-of-order cores don’t scale –Simpler solution: multi-core architectures  No speedup for single thread applications –Use Thread Level Speculation to extract TLP –Use Helper Threads or RunAhead to improve ILP  However for different apps. (or phases) some models work better than some others  Our Proposal: –Combine these execution models –Decide at runtime when to employ them

ICS Contributions  Introduce mixed Speculative Multithreading (SM) Execution Models  Design one that combines TLS, HT and RA  Propose a performance model able to quantify ILP and TLP benefits  Unified approach outperforms state-of-the-art SM models: –TLS by 10.2% avg. (up to 41.2%) –RA by 18.3 % avg. (up to 35.2%)

ICS Outline  Introduction  Speculative Multithreading Models  Performance Model  Unified Scheme  Experimental Setup and Results  Conclusions

Helper Threads  Compiler deals with: –Memory ops miss/ hard- to-predict branches –Backward slices  HW deals with: –Spawn threads –Different context –Discard when finished  Benefit: –ILP (Prefetch/Warmup) ICS 20095

RunAhead Execution  Compiler deals with: –Nothing  HW deals with: –Different context –When to do RA –VP Memory –Commit/Discard  Benefit: –ILP (Prefetch/Warmup) ICS 20096

7 Thread Level Speculation  Compiler deals with: –Task selection –Code generation  HW deals with: –Different context –Spawn threads –Detecting violations –Replaying –Arbitrate commit  Benefit: TLP/ILP –TLP (Overlapped Execution)  + ILP (Prefetching)

ICS Outline  Introduction  Speculative Multithreading Models  Performance Model  Unified Scheme  Experimental Setup and Results  Conclusions

ICS Understanding Performance Benefits  Complex TLS thread interactions, obscure performance benefits  Even more true for mixed execution models  We need a way to quantify ILP and TLP contributions to bottom-line performance  Proposed model: –Able to break benefits in ILP/TLP contributions

Performance Model  Sall = Sseq x Silp x Sovl 1.Compute overall speedup (Sall) ICS Tseq/Tmt

Performance Model  Sall = Sseq x Silp x Sovl 1.Compute overall speedup (Sall) 2.Compute sequential TLS speedup (Sseq) ICS Tseq/T1p

Performance Model  Sall = Sseq x Silp x Sovl 1.Compute overall speedup (Sall) 2.Compute sequential TLS speedup (Sseq) 3.Compute speedup due to ILP (Silp) ICS (T1+T2)/(T1’+T2’)

Performance Model  Sall = Sseq x Silp x Sovl 1.Compute overall speedup (Sall) 2.Compute sequential TLS speedup (Sseq) 3.Compute speedup due to ILP (Silp) 4.Use everything to compute TLP (Sovl) ICS Sall/(Sseq x Silp)

ICS Outline  Introduction  Speculative Multithreading Models  Performance Model  Unified Scheme  Experimental Setup and Results  Conclusions

Unified Execution Model  Can we improve TLS? 1.Some of the threads do not help 2.Slack in usage of cores  Improve TLP: –Requires a better compiler  Improve ILP: –Combine TLS with another SM ! –Most of the HW common ICS

ICS Combining TLS, HT and RA  Start with TLS  Provide support to clone TLS threads and convert them to HT  Conversion to HT means: –Put them in RA mode –Suppress squashes and do not cause additional squashes –Discard them when they finish  No compiler slicing  purely HW approach

Intricacies to be Handled  HT may not prefetch effectively!  Dealing with contention –HT threads much faster  saturate BW  Dealing with thread ordering –TLS imposes total thread order –HT killed  squashes TLS threads ICS

Creating and Terminating HT  Create a HT on a L2 miss we can VP –Use mem. address based confidence estimator –VP only if confident  Create a HT if we have a free processor  Only allow most speculative thread to clone –Seamless integration of HT with TLS –BUT: if parent no longer the most spec. TLS thread, the HT has to be killed  Additionally kill HT when: –Parent/HT thread finishes –HT causes exception ICS

ICS Outline  Introduction  Speculative Multithreading Models  Performance Model  Unified Scheme  Experimental Setup and Results  Conclusions

ICS Experimental Setup  Simulator, Compiler and Benchmarks: –SESC ( –POSH (Liu et al. PPoPP ‘06) –Spec 2000 Int.  Architecture: –Four way CMP, 4-Issue cores –16KB L1 Data (multi-versioned) and Instruction Caches –1MB unified L2 Caches –Inst. window/ROB – 80/104 entries –16KB Last Value Predictor

ICS Comparing TLS, RunAhead and Unified Scheme

ICS Comparing TLS, RunAhead and Unified Scheme  Almost additive benefits

ICS Comparing TLS, RunAhead and Unified Scheme  Almost additive benefits  10.2% over TLS, 18.3% over RA

Understanding the extra ILP  Improvements of ILP come from: –Mainly memory –Branch prediction (improvement 0.5%)  Focus on memory: –Miss rate on committed path –Clustering of misses (different cost) ICS

Normalized Shared Cache Misses  All schemes better than sequential  Unified 41% better than sequential ICS200925

Isolated vs. Clustered Misses.  Both TLS + RA  Large window machines  Unified does even better ICS

ICS Outline  Introduction  Multithreading Models  Performance Model  Unified Scheme  Experimental Setup and Results  Conclusions

Also on the paper …  Dealing with the load of the system  Converting TLS threads to HT  Multiple HT  Effect of a better VP  Detailed comparison of performance model against existing models (Renau et. al ICS ’05) ICS

ICS Conclusions  CMPs are here to stay: – What about single threaded apps. and apps with significant seq. sections?  Different apps. require different SM techniques –Even within apps. different phases  We propose the first mixed execution model –TLS is nicely complemented by HT and RA  Our unified scheme outperforms existing SM models –TLS by 10.2% avg. (up to 41.2%) –RA by 18.3 % avg. (up to 35.2%)

Combining Thread Level Speculation, Helper Threads, and Runahead Execution Polychronis Xekalakis Nikolas Ioannou and Marcelo Cintra University of Edinburgh