Dynamic Region Selection for Thread Level Speculation Presented by: Jeff Da Silva Stanley Fung Martin Labrecque Feb 6, 2004 Builds on research done by:

Slides:



Advertisements
Similar presentations
Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,
Advertisements

Automatic Memory Management Noam Rinetzky Schreiber 123A /seminar/seminar1415a.html.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Alias Speculation using Atomic Regions (To appear at ASPLOS 2013) Wonsun Ahn*, Yuelu Duan, Josep Torrellas University of Illinois at Urbana Champaign.
Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras
An Evaluation of OpenMP on Current and Emerging Multithreaded/Multicore Processors Matthew Curtis-Maury, Xiaoning Ding, Christos D. Antonopoulos, and Dimitrios.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
ENERGY AND POWER CHARACTERIZATION OF PARALLEL PROGRAMS RUNNING ON THE INTEL XEON PHI JOAL WOOD, ZILIANG ZONG, QIJUN GU, RONG GE {JW1772, ZILIANG,
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Fundamentals of Python: From First Programs Through Data Structures
TU/e Processor Design 5Z032 1 Processor Design 5Z032 The role of Performance Henk Corporaal Eindhoven University of Technology 2009.
Online Performance Auditing Using Hot Optimizations Without Getting Burned Jeremy Lau (UCSD, IBM) Matthew Arnold (IBM) Michael Hind (IBM) Brad Calder (UCSD)
Compiler Optimization of scalar and memory resident values between speculative threads. Antonia Zhai et. al.
Optimistic Intra-Transaction Parallelism on Chip-Multiprocessors Chris Colohan 1, Anastassia Ailamaki 1, J. Gregory Steffan 2 and Todd C. Mowry 1,3 1 Carnegie.
A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University.
CS 7810 Lecture 18 The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization J.G. Steffan and T.C. Mowry Proceedings.
Memory Systems Performance Workshop 2004© David Ryan Koes MSP 2004 Programmer Specified Pointer Independence David Koes Mihai Budiu Girish Venkataramani.
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania Computer Organization Pipelined Processor Design 3.
Multiscalar processors
A Mathematical Model for Balancing Co-Phase Effects in Simulated Multithreaded Systems Joshua L. Kihm, Tipp Moseley, and Dan Connors University of Colorado.
ECE 510 Brendan Crowley Paper Review October 31, 2006.
By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and
1 Improving Value Communication…Steffan Carnegie Mellon Improving Value Communication for Thread-Level Speculation Greg Steffan, Chris Colohan, Antonia.
Joram Benham April 2,  Introduction  Motivation  Multicore Processors  Overview, CELL  Advantages of CMPs  Throughput, Latency  Challenges.
Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.
Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Supercomputers – David Bailey (1991) Eileen Kraemer August 25, 2002.
Hybrid-Scheduling: A Compile-Time Approach for Energy–Efficient Superscalar Processors Madhavi Valluri and Lizy John Laboratory for Computer Architecture.
1 Improving Value Communication…Steffan Carnegie Mellon Improving Value Communication for Thread-Level Speculation Greg Steffan, Chris Colohan, Antonia.
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.
Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.
Thread-Level Speculation Karan Singh CS
Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.
CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.
Adaptive Multi-Threading for Dynamic Workloads in Embedded Multiprocessors 林鼎原 Department of Electrical Engineering National Cheng Kung University Tainan,
Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.
1. 2 Pipelining vs. Parallel processing  In both cases, multiple “things” processed by multiple “functional units” Pipelining: each thing is broken into.
Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory.
Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan.
1 The potential for Software-only thread- level speculation Depth Oral Presentation Co-Supervisors: Prof. Greg. Steffan Prof. Cristina Amza Committee Members.
Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan.
A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
Exploiting Instruction Streams To Prevent Intrusion Milena Milenkovic.
High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.
Dynamic Parallelization of JavaScript Applications Using an Ultra-lightweight Speculation Mechanism ECE 751, Fall 2015 Peng Liu 1.
Lab Activities 1, 2. Some of the Lab Server Specifications CPU: 2 Quad(4) Core Intel Xeon 5400 processors CPU Speed: 2.5 GHz Cache : Each 2 cores share.
Michael J. Voss and Rudolf Eigenmann PPoPP, ‘01 (Presented by Kanad Sinha)
Lecture 1: Introduction CprE 585 Advanced Computer Architecture, Fall 2004 Zhao Zhang.
EECS 583 – Class 18 Research Topic 1 Breaking Dependences, Dynamic Parallelization University of Michigan November 14, 2012.
Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.
Threads Cannot Be Implemented As a Library
Atomic Operations in Hardware
Atomic Operations in Hardware
Morgan Kaufmann Publishers
Martin Rinard Laboratory for Computer Science
Antonia Zhai, Christopher B. Colohan,
CSCI1600: Embedded and Real Time Software
Hardware Multithreading
Efficient software checkpointing framework for speculative techniques
Jinquan Dai, Long Li, Bo Huang Intel China Software Center
Hardware Multithreading
Mapping DSP algorithms to a general purpose out-of-order processor
Maximizing Speedup through Self-Tuning of Processor Allocation
rePLay: A Hardware Framework for Dynamic Optimization
CSCI1600: Embedded and Real Time Software
Presentation transcript:

Dynamic Region Selection for Thread Level Speculation Presented by: Jeff Da Silva Stanley Fung Martin Labrecque Feb 6, 2004 Builds on research done by: Chris Colohan from CMU Greg Steffan

Dynamic Region Selection for Thread Level Speculation University of Toronto 2 Multithreading on a Chip is here TODAY! Supercomputers Threads of Execution Desktops Chip Multiprocessor Cache Proc (IBM Power4/5, SUN MAJC, Ultrasparc 4)  but what can we do with them? Simultaneous- Multithreading (ALPHA 21464, Intel Xeon, Pentium IV) Cache Proc

Dynamic Region Selection for Thread Level Speculation University of Toronto 3 C C P C P C P C P C C P With a bunch of independent applications: Execution Time  improves throughput (total work per second) Processor Caches Applications Improving Performance with a Chip Multiprocessor

Dynamic Region Selection for Thread Level Speculation University of Toronto 4 C C P C P C P C P C C P With a single application:  need parallel threads to reduce execution time C C P C P C P C P  Exec. Time Improving Performance with a Chip Multiprocessor

Dynamic Region Selection for Thread Level Speculation University of Toronto 5 Thread-Level Speculation: the Basic Idea   exploit available thread-level parallelism Exec. Time TLS …  *q *p  … Recover …  *q  violation

Dynamic Region Selection for Thread Level Speculation University of Toronto 6 Support for TLS: What Do We Need?  Break programs into speculative threads »to maximize thread-level parallelism  Track data dependences »to determine whether speculation was safe  Recover from failed speculation »to ensure correct execution    three key elements of every TLS system

Dynamic Region Selection for Thread Level Speculation University of Toronto 7  Lots of research has been done on TLS hardware »Tracking data dependence »Recover from violation  We focus on how to select regions to run in parallel »A region is any segment of code that you want to speculatively parallelize »For this work, region == loop, iterations == speculative threads Support for TLS: What Do We Need?

Dynamic Region Selection for Thread Level Speculation University of Toronto 8 Why is static region selection hard?  Extensive profiling information  Regions can be nested for ( i = 1 to N ) { <= 2x faster in parallel …. for ( j = 1 to N ) { <= 3x faster in parallel …. for ( k = 1 to N ) { <= 4x faster in parallel …. } Which loop should we parallelize? }  Dynamic behaviour  Dynamic Region Selection is a potential solution

Dynamic Region Selection for Thread Level Speculation University of Toronto 9  Compiler transforms all candidate regions into parallel and sequential versions  Through dynamic profiling, we decide which regions are to be run in parallel  Key Questions: »Is there any dynamic behaviour between region instances? »What is a good algorithm for selecting regions? »Are there performance trade-offs for doing dynamic profiling? »Is there any dynamic behaviour within region instances? (not the focus of this research) Dynamic Region Selection

Dynamic Region Selection for Thread Level Speculation University of Toronto 10 Outline  The role of the TLS compiler  Characterizing dynamic behaviour  Dynamic Region Selection (DRS) algorithms  Results  Conclusions  Open questions and future work

Dynamic Region Selection for Thread Level Speculation University of Toronto 11 Current Compilation for TLS LoopA LoopB EndB LoopC LoopD EndD EndC EndA LoopE LoopF EndF EndE LoopG LoopH EndH SequentialParallel LoopA LoopB EndB LoopC LoopD EndD EndC EndA LoopE LoopF EndF EndE LoopG LoopH EndH

Dynamic Region Selection for Thread Level Speculation University of Toronto 12 DRS Compilation LoopA LoopB EndB LoopC LoopD EndD EndC EndA LoopE LoopF EndF EndE LoopG LoopH EndH LoopA LoopB EndB LoopC LoopD EndD EndC EndA LoopE LoopF EndF EndE LoopG LoopH EndH

Dynamic Region Selection for Thread Level Speculation University of Toronto 13 DRS Compilation 1 Extract candidate region E

Dynamic Region Selection for Thread Level Speculation University of Toronto 14 DRS Compilation 2 Create sequential and parallel versions of the region (Clone) E E 1 Extract candidate region

Dynamic Region Selection for Thread Level Speculation University of Toronto 15 DRS Compilation 3 Add some extra overhead to monitor the region’s performance 1 Extract candidate region 2 Create sequential and parallel versions of the region (Clone) E E

Dynamic Region Selection for Thread Level Speculation University of Toronto 16 DRS Compilation 3 Add some extra overhead to monitor the region’s performance 1 Extract candidate region 2 Create sequential and parallel versions of the region (Clone) 4 Introduce a DRS algorithm to make the decision at runtime E E DRS Algorithm

Dynamic Region Selection for Thread Level Speculation University of Toronto 17 DRS Compilation 3 Add some extra overhead to monitor the region’s performance 1 Extract candidate region 2 Create sequential and parallel versions of the region (Clone) 4 Introduce a DRS algorithm to make the decision at runtime E E DRS Algorithm  DRS Compilation by Colohan

Dynamic Region Selection for Thread Level Speculation University of Toronto 18 Characterizing TLS Region Behaviour

Dynamic Region Selection for Thread Level Speculation University of Toronto 19 Characterizing TLS Region Behaviour

Dynamic Region Selection for Thread Level Speculation University of Toronto 20 DRS Algorithms 1)Sample Twice 2)Continuous Monitoring 3)Continuous Resample 4)Path Sensitive Sampling

Dynamic Region Selection for Thread Level Speculation University of Toronto 21 Sample Twice Algorithm  Effective if behaviour is constant.  When a region is encountered: »1 st Time: Run sequential version and record execution time t 1 »2 nd Time: Run parallel version (if possible) and record execution time t p »Subsequent instances:  if t p < t 1 then run parallel version  else run sequential version  Note that by using execution time as a metric, it is assumed that the amount of work done from instance to instance remains relatively constant. Using throughput (IPC) as a metric eliminates the need for this assumption but adds additional complexity.

Dynamic Region Selection for Thread Level Speculation University of Toronto 22 Sample Twice Example Sample Sequential? Sample Parallel? Decided

Dynamic Region Selection for Thread Level Speculation University of Toronto 23 Continuous Monitoring  Extension to sample twice method. Continuously monitor all regions and reevaluate your decision if speedup changes. »Not doing much more besides monitoring continuously -> the overhead is free.  When a region is encountered: »1 st Time: Run sequential version and record execution time t 1 »2 nd Time: Run parallel version (if possible) and record execution time t p »Subsequent instances:  if t p < t 1 then run parallel version and update t p  else run sequential version and update t 1  Effective if behaviour is continuously degrading.

Dynamic Region Selection for Thread Level Speculation University of Toronto 24 Continuous Monitoring Example Sample Sequential? Sample Parallel? Decided t 1 = NA t p = NA t 1 = 5 t p = NA t 1 = 5 t p = 3 t 1 = 5 t p = 4 t 1 = 5 t p = 6 t 1 = 4 t p = 6

Dynamic Region Selection for Thread Level Speculation University of Toronto 25 Continuous Resample  Effective if behaviour is continuously changing.  Continuously resample by flushing values t 1 and t p periodically.  Adds new overhead.  This algorithm has not yet been explored.

Dynamic Region Selection for Thread Level Speculation University of Toronto 26 Path Sensitive Sampling  If the behaviour is periodic, a means of filtering is required.  One intuitive solution is to sample when the invocation path or region nesting path changes.

Dynamic Region Selection for Thread Level Speculation University of Toronto 27 Path Sensitive Sampling  Sample when region nesting path changes »Makes the assumption that state stays the same if the invocation path does not change void foo() { while(cond) moo(); } void bar() { while(cond) moo(); } void moo() { while(cond) moo(); } foo_whilebar_while moo_while

Dynamic Region Selection for Thread Level Speculation University of Toronto 28 Results – Static analysis Average number of per-path instances for all regions

Dynamic Region Selection for Thread Level Speculation University of Toronto 29 Interesting Region in IJPEG Number of speculative threads per region instance Program execution 

Dynamic Region Selection for Thread Level Speculation University of Toronto 30 Interesting Region in Perl Program execution  Number of instructions per region instance

Dynamic Region Selection for Thread Level Speculation University of Toronto 31 Experimental Framework  SPEC benchmarks  TLS compiler  MIPS architecture  TLS profiler and simulator

Dynamic Region Selection for Thread Level Speculation University of Toronto 32 Outline  The role of the TLS compiler  Characterizing dynamic behaviour  Dynamic Region Selection (DRS) algorithms  Results  Conclusions  Open questions and future work

Dynamic Region Selection for Thread Level Speculation University of Toronto 33  Is there any dynamic behavior between region instances?

Dynamic Region Selection for Thread Level Speculation University of Toronto 34 Results – Dynamic behavior  Regions with high coverage have low instruction variance between instances

Dynamic Region Selection for Thread Level Speculation University of Toronto 35 Results – Dynamic behavior  Regions with high coverage have low violation variance between instances

Dynamic Region Selection for Thread Level Speculation University of Toronto 36 Results – Dynamic behavior  Regions with high coverage have low speculative thread count variance between instances

Dynamic Region Selection for Thread Level Speculation University of Toronto 37  What is a good algorithm for selecting regions?

Dynamic Region Selection for Thread Level Speculation University of Toronto 38 static optimal faster slower  Continuous monitoring 1% better on average than sample twice  About 10% worse than static ‘optimal’ selection

Dynamic Region Selection for Thread Level Speculation University of Toronto 39  How often did we agree with the ‘optimal’ selection?

Dynamic Region Selection for Thread Level Speculation University of Toronto 40 static optimal  Sample twice agrees 57% of the time, on average  Continuous monitoring agrees 43% of the time, on average  Levels of agreement are close  no dynamic behavior?

Dynamic Region Selection for Thread Level Speculation University of Toronto 41  Agreeing with static ‘optimal’ gives better performance?  Another sign of no dynamic behaviour?

Dynamic Region Selection for Thread Level Speculation University of Toronto 42  Sample twice often leaves regions undecided  Overall, undecided regions represent low coverage

Dynamic Region Selection for Thread Level Speculation University of Toronto 43 Outline  The role of the TLS compiler  Characterizing dynamic behaviour  Dynamic Region Selection (DRS) algorithms  Results  Conclusions  Open questions and future work

Dynamic Region Selection for Thread Level Speculation University of Toronto 44 Conclusions  This is an unexplored research topic (as far as we know)  Is there any dynamic behavior between region instances?  We have good indications that there isn’t tons of it  What is the best algorithm for selecting regions?  Continuous sampling does 1% better than sample twice  Within 10% of the static ‘optimal’ without any sampling done!  Any performance trade-offs for doing dynamic profiling?  The code size is increased by at most 30%  The runtime performance overhead is believed to be negligible  Is there any dynamic behavior within a region instance?  We don’t know yet

Dynamic Region Selection for Thread Level Speculation University of Toronto 45 Open Questions  The dynamic optimal is the theoretical optimal  How close are we from the dynamic optimal?  How close is the static ‘optimal’ to the dynamic optimal?  How do the other proposed algorithms perform?  What should be implemented in hardware/software?

Dynamic Region Selection for Thread Level Speculation University of Toronto 46 Questions?

Dynamic Region Selection for Thread Level Speculation University of Toronto 47 AUXILIARY SLIDES

Dynamic Region Selection for Thread Level Speculation University of Toronto 48 Results – Potential Study Execution time versus invocation (IJPEG)

Dynamic Region Selection for Thread Level Speculation University of Toronto 49 Results – Potential Study Execution time versus invocation (CRAFTY)

Dynamic Region Selection for Thread Level Speculation University of Toronto 50 Results – Potential Study Execution time versus invocation (LI)

Dynamic Region Selection for Thread Level Speculation University of Toronto 51 Results – Potential Study Execution time versus invocation (PERL)

Dynamic Region Selection for Thread Level Speculation University of Toronto 52 Results – Static analysis

Dynamic Region Selection for Thread Level Speculation University of Toronto 53 Results – Dynamic behavior

Dynamic Region Selection for Thread Level Speculation University of Toronto 54 Results – Dynamic behavior

Dynamic Region Selection for Thread Level Speculation University of Toronto 55 Results – Dynamic behavior