Heuristics for Profile-driven Method- level Speculative Parallelization John Whaley and Christos Kozyrakis Stanford University June 15, 2005.

Slides:



Advertisements
Similar presentations
Toward a More Accurate Understanding of the Limits of the TLS Execution Paradigm Nikolas Ioannou, Jeremy Singer, Salman Khan, Polychronis Xekalakis, Paraskevas.
Advertisements

Increasing the Energy Efficiency of TLS Systems Using Intermediate Checkpointing Salman Khan 1, Nikolas Ioannou 2, Polychronis Xekalakis 3 and Marcelo.
ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
Fall EE 333 Lillevik 333f06-l20 University of Portland School of Engineering Computer Organization Lecture 20 Pipelining: “bucket brigade” MIPS.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Resurrector: A Tunable Object Lifetime Profiling Technique Guoqing Xu University of California, Irvine OOPSLA’13 Conference Talk 1.
Chapter 5 CPU Scheduling. CPU Scheduling Topics: Basic Concepts Scheduling Criteria Scheduling Algorithms Multiple-Processor Scheduling Real-Time Scheduling.
Pipelined Profiling and Analysis on Multi-core Systems Qin Zhao Ioana Cutcutache Weng-Fai Wong PiPA.
Heap Shape Scalability Scalable Garbage Collection on Highly Parallel Platforms Kathy Barabash, Erez Petrank Computer Science Department Technion, Israel.
Online Performance Auditing Using Hot Optimizations Without Getting Burned Jeremy Lau (UCSD, IBM) Matthew Arnold (IBM) Michael Hind (IBM) Brad Calder (UCSD)
UPC Reducing Misspeculation Penalty in Trace-Level Speculative Multithreaded Architectures Carlos Molina ψ, ф Jordi Tubella ф Antonio González λ,ф ISHPC-VI,
1 Characterizing the Sort Operation on Multithreaded Architectures Layali Rashid, Wessam M. Hassanein, and Moustafa A. Hammad* The Advanced Computer Architecture.
Chapter 12 Pipelining Strategies Performance Hazards.
Previous finals up on the web page use them as practice problems look at them early.
1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )
Multiscalar processors
1 Slice-Balancing H.264 Video Encoding for Improved Scalability of Multicore Decoding Michael Roitzsch Technische Universität Dresden ACM & IEEE international.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
A One-Shot Configurable- Cache Tuner for Improved Energy and Performance Ann Gordon-Ross 1, Pablo Viana 2, Frank Vahid 1, Walid Najjar 1, and Edna Barros.
Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.
University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.
Parallelizing Security Checks on Commodity Hardware E.B. Nightingale, D. Peek, P.M. Chen and J. Flinn U Michigan.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
ReSlice: Selective Re-execution of Long-retired Misspeculated Instructions Using Forward Slicing Smruti R. Sarangi, Wei Liu, Josep Torrellas, Yuanyuan.
Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.
CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
1 Out-Of-Order Execution (part I) Alexander Titov 14 March 2015.
Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
Lecture 14: Caching, cont. EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014, Dr.
CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.
Virtual Application Profiler (VAPP) Problem – Increasing hardware complexity – Programmers need to understand interactions between architecture and their.
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
Sunpyo Hong, Hyesoon Kim
Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.
Transactional Memory Coherence and Consistency Lance Hammond, Vicky Wong, Mike Chen, Brian D. Carlstrom, John D. Davis, Ben Hertzberg, Manohar K. Prabhu,
Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison Program Demultiplexing: Data-flow based Speculative Parallelization.
A Quantitative Framework for Pre-Execution Thread Selection Gurindar S. Sohi University of Wisconsin-Madison MICRO-35 Nov. 22, 2002 Amir Roth University.
Lecture 4 CPU scheduling. Basic Concepts Single Process  one process at a time Maximum CPU utilization obtained with multiprogramming CPU idle :waiting.
Dynamic Region Selection for Thread Level Speculation Presented by: Jeff Da Silva Stanley Fung Martin Labrecque Feb 6, 2004 Builds on research done by:
Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,
Computer Architecture Chapter (14): Processor Structure and Function
CPU SCHEDULING.
Jacob R. Lorch Microsoft Research
Multiscalar Processors
The University of Adelaide, School of Computer Science
5.2 Eleven Advanced Optimizations of Cache Performance
Chapter 6: CPU Scheduling
Process management Information maintained by OS for process management
CSCI1600: Embedded and Real Time Software
Department of Computer Science University of California, Santa Barbara
Levels of Parallelism within a Single Processor
CPU Scheduling G.Anuradha
Module 5: CPU Scheduling
Hardware Multithreading
Chapter 5: CPU Scheduling
Chapter 6: CPU Scheduling
* From AMD 1996 Publication #18522 Revision E
Chapter 5: CPU Scheduling
Chapter 6: CPU Scheduling
Maximizing Speedup through Self-Tuning of Processor Allocation
Department of Computer Science University of California, Santa Barbara
Chapter 6: CPU Scheduling
CSCI1600: Embedded and Real Time Software
Rohan Yadav and Charles Yuan (rohany) (chenhuiy)
Presentation transcript:

Heuristics for Profile-driven Method- level Speculative Parallelization John Whaley and Christos Kozyrakis Stanford University June 15, 2005

Heuristics for Profile-driven Method- level Speculative Parallelization 1 Speculative Multithreading Speculatively parallelize an application –Uses speculation to overcome ambiguous dependencies –Uses hardware support to recover from misspeculation –Promising technique for automatically extracting parallelism from programs Problem: Where to put the threads?

June 15, 2005Heuristics for Profile-driven Method- level Speculative Parallelization 2 Method-Level Speculation Idea: Use method boundaries as speculative threads –Computation is naturally partitioned into methods –Execution often independent –Well-defined interface Extract parallelism from irregular, non- numerical applications

June 15, 2005Heuristics for Profile-driven Method- level Speculative Parallelization 3 Method-Level Speculation Example main() { work_A; foo(); work_C; // reads *q } foo() { work_B; // writes *p }

June 15, 2005Heuristics for Profile-driven Method- level Speculative Parallelization 4 main() { work_A; foo() { work_B; // writes *p } work_C; // reads *q } Method-Level Speculation Example

June 15, 2005Heuristics for Profile-driven Method- level Speculative Parallelization 5 main() { work_A; foo() { work_B; // writes *p } work_C; // reads *q } work_A foo() work_B work_C Sequential execution Method-Level Speculation Example

June 15, 2005Heuristics for Profile-driven Method- level Speculative Parallelization 6 main() { work_A; foo() { work_B; // writes *p } work_C; // reads *q } TLS execution – no violation work_A foo() work_B work_C overhead fork p!=q No violation Method-Level Speculation Example

June 15, 2005Heuristics for Profile-driven Method- level Speculative Parallelization 7 main() { work_A; foo() { work_B; // writes *p } work_C; // reads *q } TLS execution – violation work_A foo() work_B work_C overhead fork p=q Violation! overhead work_C (aborted) Method-Level Speculation Example

June 15, 2005Heuristics for Profile-driven Method- level Speculative Parallelization 8 work_A foo() work_B work_C work_A foo() work_B work_C overhead fork p!=q No violation work_A foo() work_B work_C overhead p=q Violation! overhead work_C (aborted) fork Method-Level Speculation Example SequentialTLS – no violationTLS – violation

June 15, 2005Heuristics for Profile-driven Method- level Speculative Parallelization 9 Nested Speculation foo() work_A work_B overhead fork bar() work_C main() { foo() { work_A; } work_B; bar() { work_C; } work_D; } fork overhead work_D Sequences of method calls can cause nested speculation.

June 15, 2005Heuristics for Profile-driven Method- level Speculative Parallelization 10 This Talk: Choosing Speculation Points Which methods to speculate? –Low chance of violation –Not too short, not too long –Not too many stores Idea: Use profile data to choose good speculation points –Used for profile-driven and dynamic compiler –Should be low-cost but accurate We evaluated 7 different heuristics –~80% effective compared to perfect oracle

June 15, 2005Heuristics for Profile-driven Method- level Speculative Parallelization 11 Difficulties in Method-Level Speculation Method invocations can have varying execution times –Too short: Doesn’t overcome speculation overhead –Too long: More likely to violate or overflow, prevents other threads from retiring Return values –Mispredicted return value causes violation

June 15, 2005Heuristics for Profile-driven Method- level Speculative Parallelization 12 Classes of Heuristics Simple Heuristics –Use only simple information, such as method runtime Single-Pass Heuristics –More advanced information, such as sequence of store addresses –Single pass through profile data Multi-Pass Heuristics –Multiple passes through profile data

June 15, 2005Heuristics for Profile-driven Method- level Speculative Parallelization 13 Classes of Heuristics Simple Heuristics –Use only simple information, such as method runtime Single-Pass Heuristics –More advanced information, such as sequence of store addresses –Single pass through profile data Multi-Pass Heuristics –Multiple passes through profile data

June 15, 2005Heuristics for Profile-driven Method- level Speculative Parallelization 14 Runtime Heuristic (SI-RT) Speculate on all methods with: –MIN < runtime < MAX Idea: Should be long enough to amortize overhead, but not long enough to violate Data required: –Average runtime of each method

June 15, 2005Heuristics for Profile-driven Method- level Speculative Parallelization 15 Store Heuristic (SI-SC) Speculate on all methods with: –dynamic # of stores < MAX Idea: Stores cause violations, so speculate on methods with few stores Data required: –Average dynamic store count of each method

June 15, 2005Heuristics for Profile-driven Method- level Speculative Parallelization 16 Classes of Heuristics Simple Heuristics –Use only simple information, such as method runtime Single-Pass Heuristics –More advanced information, such as sequence of store addresses –Single pass through profile data Multi-Pass Heuristics –Multiple passes through profile data

June 15, 2005Heuristics for Profile-driven Method- level Speculative Parallelization 17 Stalled Threads bar() work_A work_B overhead fork foo() { bar() { work_A; } work_B; } idle Speculative threads may stall while waiting to become main thread.

June 15, 2005Heuristics for Profile-driven Method- level Speculative Parallelization 18 Fork at intermediate points bar() work_A work_B overhead fork foo() { bar() { work_A; } work_B; } Fork at an intermediate point within a method to avoid violations and stalling

June 15, 2005Heuristics for Profile-driven Method- level Speculative Parallelization 19 Best Speedup Heuristic (SP-SU) Speculate on methods with: –predicted speedup > THRES Calculate predicted speedup by: Scan store stream backwards to find fork point –Choose fork point to avoid violations and stalling expected sequential run time expected parallel run time

June 15, 2005Heuristics for Profile-driven Method- level Speculative Parallelization 20 Most Cycles Saved Heuristic (SP-CS) Speculate on methods with: –predicted cycle savings > THRES Calculate predicted cycle savings by: Place fork point such that: –predicted probability of violation < RATIO Uses same information as SP-SU sequential cycle count – parallel cycle count

June 15, 2005Heuristics for Profile-driven Method- level Speculative Parallelization 21 Classes of Heuristics Simple Heuristics –Use only simple information, such as method runtime Single-Pass Heuristics –More advanced information, such as sequence of store addresses –Single pass through profile data Multi-Pass Heuristics –Multiple passes through profile data

June 15, 2005Heuristics for Profile-driven Method- level Speculative Parallelization 22 Nested Speculation foo() work_A work_D overhead fork bar() work_B overhead foo() work_C idle fork main() { foo() { work_A; bar() { work_B; } work_C; } work_D; } Effectiveness of speculation choice depends on choices for caller methods!

June 15, 2005Heuristics for Profile-driven Method- level Speculative Parallelization 23 Best Speedup Heuristic with Parent Info (MP-SU) Iterative algorithm: –Choose speculation with best speedup –Readjust all callee methods to account for speculation in caller –Repeat until best speedup < THRES Max # of iterations: depth of call graph

June 15, 2005Heuristics for Profile-driven Method- level Speculative Parallelization 24 Most Cycles Saved Heuristic with Parent Info (MP-CS) Iterative algorithm: 1.Choose speculation with most cycles saved and predicted violations < RATIO 2.Readjust all callee methods to account for speculation in caller 3.Repeat until most cycles saved < THRES Multi-pass version of SP-CS

June 15, 2005Heuristics for Profile-driven Method- level Speculative Parallelization 25 Most Cycles Saved Heuristic with No Nesting (MP-CSNN) Iterative algorithm: –Choose speculation with most cycles saved and predicted violations < RATIO. –Eliminate all callee methods from consideration. –Repeat until most cycles saved < THRES. Disallows nested speculation to avoid double-counting the benefits Faster to compute than MP-CS

June 15, 2005Heuristics for Profile-driven Method- level Speculative Parallelization 26 Experimental Results

June 15, 2005Heuristics for Profile-driven Method- level Speculative Parallelization 27 Trace-Driven Simulation How to find the optimal parameters (THRES, RATIO, etc.) ? Parameter sweeps –For each benchmark For each heuristic –Multiple parameters for each heuristic For cycle-accurate simulation: >100 CPU years?! Alternative: trace-driven simulation

June 15, 2005Heuristics for Profile-driven Method- level Speculative Parallelization 28 Trace-Driven Simulation 1.Collect trace on Pentium III (3-way out-of-order CPU, 32K L1, 256K L2) –Record all memory accesses, enter/exit method events, etc. 2.Recalibrate to remove instrumentation overhead 3.Simulate trace on 4-way CMP hardware –Model shared cache, speculation overheads, dependencies, squashing, etc. Spot check with cycle-accurate simulator: Accurate within ~3%

June 15, 2005Heuristics for Profile-driven Method- level Speculative Parallelization 29 Simulated Architecture Four 3-way out-of-order CPUs –32K L1, 256K shared L2 Single speculative buffer per CPU Forking, retiring, squashing overhead: 70 cycles each Speculative threads can be preempted –Low priority speculations can be squashed by higher priority ones

June 15, 2005Heuristics for Profile-driven Method- level Speculative Parallelization 30 The Oracle A “Perfect” Oracle –Preanalyzes entire trace –Makes a separate decision on every method invocation –Chooses fork points to never violate –Zero overhead for forking or retiring threads Upper-bound on performance of any heuristic

June 15, 2005Heuristics for Profile-driven Method- level Speculative Parallelization 31 Benchmarks SpecJVM –compress: Lempel-Ziv compression –jack: Java parser generator –javac: Java compiler from the JDK –jess: Java expert shell system –mpeg: Mpeg layer 3 audio decompression –raytrace: Raytracer that works on a dinosaur scene SPLASH-2 –barnes: Hierarchical N-body solver –water: Simulation of water molecules

June 15, 2005Heuristics for Profile-driven Method- level Speculative Parallelization 32 Heuristic Parameter Tuning

June 15, 2005Heuristics for Profile-driven Method- level Speculative Parallelization 33 Heuristic Parameter Tuning

June 15, 2005Heuristics for Profile-driven Method- level Speculative Parallelization 34 Heuristic Parameter Tuning

June 15, 2005Heuristics for Profile-driven Method- level Speculative Parallelization 35 Heuristic Parameter Tuning

June 15, 2005Heuristics for Profile-driven Method- level Speculative Parallelization 36 Heuristic Parameter Tuning

June 15, 2005Heuristics for Profile-driven Method- level Speculative Parallelization 37 Heuristic Parameter Tuning

June 15, 2005Heuristics for Profile-driven Method- level Speculative Parallelization 38 Heuristic Parameter Tuning

June 15, 2005Heuristics for Profile-driven Method- level Speculative Parallelization 39 Heuristic Parameter Tuning

June 15, 2005Heuristics for Profile-driven Method- level Speculative Parallelization 40 Heuristic Parameter Tuning

June 15, 2005Heuristics for Profile-driven Method- level Speculative Parallelization 41 Heuristic Parameter Tuning

June 15, 2005Heuristics for Profile-driven Method- level Speculative Parallelization 42 Tuning Summary Runtime (SI-RT): –MIN = 10 3 cycles, MAX = 10 7 cycles Store (SI-SC): –MAX = 10 5 stores Best speedup (SP-SU, MP-SU): –Single pass: MIN = 1.2x speedup –Multi pass: MIN = 1.4x speedup Most cycles saved (SP-CS, MP-CS, MP-CSNN): –THRES = 10 5 cycles saved, RATIO = 70% violation Return value prediction: –Constant is within 15% of perfect value prediction

June 15, 2005Heuristics for Profile-driven Method- level Speculative Parallelization 43 Overall Speedups

June 15, 2005Heuristics for Profile-driven Method- level Speculative Parallelization 44 Breakdown of Speculative Threads

June 15, 2005Heuristics for Profile-driven Method- level Speculative Parallelization 45 Breakdown of Execution Time

June 15, 2005Heuristics for Profile-driven Method- level Speculative Parallelization 46 Speculative Store Buffer Size barnescompjackjavacjessmpegrtracewater SI-RT SI-SC SP-SU SP-CS MP-SU MP-CS MP- CSNN Maximum speculative store buffer size: 16KB

June 15, 2005Heuristics for Profile-driven Method- level Speculative Parallelization 47 Related Work Loop-level parallelism Method-level parallelism –Warg and Stenstrom ICPAC’01: Limit study IPDPS’03: Heuristic based on runtime CF’05: Misspeculation prediction Compilers –Multiscalar: Vijaykumar and Sohi, JPDC’99 –SpMT: Bhowmik & Chen, SPAA’02

June 15, 2005Heuristics for Profile-driven Method- level Speculative Parallelization 48 Conclusions Evaluated 7 heuristics for method-level speculation Take-home points: –Method-level speculation has complex interactions, very hard to predict –Single-pass heuristics do a good job: 80% of a perfect oracle –Most important issue is the balance between over- and under-speculating