Lengthening Traces to Improve Opportunities for Dynamic Optimization Chuck Zhao, Cristiana Amza, Greg Steffan, University of Toronto Youfeng Wu Intel Research.

Slides:

Advertisements

Similar presentations

Efficient Program Compilation through Machine Learning Techniques Gennady Pekhimenko IBM Canada Angela Demke Brown University of Toronto.

Advertisements

Software & Services Group PinPlay: A Framework for Deterministic Replay and Reproducible Analysis of Parallel Programs Harish Patil, Cristiano Pereira,

Idempotent Code Generation: Implementation, Analysis, and Evaluation Marc de Kruijf ( ) Karthikeyan Sankaralingam CGO 2013, Shenzhen.

ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto

Alias Speculation using Atomic Regions (To appear at ASPLOS 2013) Wonsun Ahn*, Yuelu Duan, Josep Torrellas University of Illinois at Urbana Champaign.

Wish Branches Combining Conditional Branching and Predication for Adaptive Predicated Execution The University of Texas at Austin *Oregon Microarchitecture.

Overview Motivations Basic static and dynamic optimization methods ADAPT Dynamo.

Program Representations. Representing programs Goals.

Trace-based Just-in-Time Type Specialization for Dynamic Languages Andreas Gal, Brendan Eich, Mike Shaver, David Anderson, David Mandelin, Mohammad R.

Trace-Based Automatic Parallelization in the Jikes RVM Borys Bradel University of Toronto.

Online Performance Auditing Using Hot Optimizations Without Getting Burned Jeremy Lau (UCSD, IBM) Matthew Arnold (IBM) Michael Hind (IBM) Brad Calder (UCSD)

6/9/2015© Hal Perkins & UW CSEU-1 CSE P 501 – Compilers SSA Hal Perkins Winter 2008.

Common Sub-expression Elim Want to compute when an expression is available in a var Domain:

The Use of Traces for Inlining in Java Programs Borys J. Bradel Tarek S. Abdelrahman Edward S. Rogers Sr.Department of Electrical and Computer Engineering.

Representing programs Goals. Representing programs Primary goals –analysis is easy and effective just a few cases to handle directly link related things.

Wish Branches A Review of “Wish Branches: Enabling Adaptive and Aggressive Predicated Execution” Russell Dodd - October 24, 2006.

Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.

Dynamic Tainting for Deployed Java Programs Du Li Advisor: Witawas Srisa-an University of Nebraska-Lincoln 1.

Incremental Path Profiling Kevin Bierhoff and Laura Hiatt Path ProfilingIncremental ApproachExperimental Results Path profiling counts how often each path.

Recap from last time: live variables x := 5 y := x + 2 x := x + 1 y := x y...

PSUCS322 HM 1 Languages and Compiler Design II IR Code Optimization Material provided by Prof. Jingke Li Stolen with pride and modified by Herb Mayer PSU.

Center for Embedded Computer Systems University of California, Irvine and San Diego SPARK: A Parallelizing High-Level Synthesis.

CISC673 – Optimizing Compilers1/34 Presented by: Sameer Kulkarni Dept of Computer & Information Sciences University of Delaware Phase Ordering.

Dynamic Optimization as typified by the Dynamo System See “Dynamo: A Transparent Dynamic Optimization System”, V. Bala, E. Duesterwald, and S. Banerjia,

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Dynamic Compilation II John Cavazos University.

Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.

Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.

1 Dimension: An Instrumentation Tool for Virtual Execution Environments Jing Yang, Shukang Zhou and Mary Lou Soffa Department of Computer Science University.

1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.

“Dynamo: A Transparent Dynamic Optimization System ” V. Bala, E. Duesterwald, and S. Banerjia, PLDI 2000 “Dynamo: A Transparent Dynamic Optimization System.

The University of Texas at Austin Lizy Kurian John, LCA, UT Austin1 What Programming Language/Compiler Researchers should Know about Computer Architecture.

CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.

Online partial evaluation of bytecodes (3)

Instrumentation in Software Dynamic Translators for Self-Managed Systems Bruce R. Childers Naveen Kumar, Jonathan Misurda and Mary.

1 CS 201 Compiler Construction Introduction. 2 Instructor Information Rajiv Gupta Office: WCH Room Tel: (951) Office.

Scalable Support for Multithreaded Applications on Dynamic Binary Instrumentation Systems Kim Hazelwood Greg Lueck Robert Cohn.

Dynamo: A Transparent Dynamic Optimization System Bala, Dueterwald, and Banerjia projects/Dynamo.

Targeted Path Profiling : Lower Overhead Path Profiling for Staged Dynamic Optimization Systems Rahul Joshi, UIUC Michael Bond*, UT Austin Craig Zilles,

Using Dynamic Binary Translation to Fuse Dependent Instructions Shiliang Hu & James E. Smith.

Practical Path Profiling for Dynamic Optimizers Michael Bond, UT Austin Kathryn McKinley, UT Austin.

Trace Fragment Selection within Method- based JVMs Duane Merrill Kim Hazelwood VEE ‘08.

Targeted Path Profiling : Lower Overhead Path Profiling for Staged Dynamic Optimization Systems Rahul Joshi, UIUC Michael Bond*, UT Austin Craig Zilles,

Compiler Optimizations ECE 454 Computer Systems Programming Topics: The Role of the Compiler Common Compiler (Automatic) Code Optimizations Cristiana Amza.

CSE 598c – Virtual Machines Survey Proposal: Improving Performance for the JVM Sandra Rueda.

Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan.

CS412/413 Introduction to Compilers Radu Rugina Lecture 18: Control Flow Graphs 29 Feb 02.

1 Control Flow Graphs. 2 Optimizations Code transformations to improve program –Mainly: improve execution time –Also: reduce program size Can be done.

CISC Machine Learning for Solving Systems Problems Presented by: Eunjung Park Dept of Computer & Information Sciences University of Delaware Solutions.

IA64 Complier Optimizations Alex Bobrek Jonathan Bradbury.

Compilation of XSLT into Dataflow Graphs for Web Service Composition Peter Kelly Paul Coddington Andrew Wendelborn.

1 ROGUE Dynamic Optimization Framework Using Pin Vijay Janapa Reddi PhD. Candidate - Electrical And Computer Engineering University of Colorado at Boulder.

Dynamic Region Selection for Thread Level Speculation Presented by: Jeff Da Silva Stanley Fung Martin Labrecque Feb 6, 2004 Builds on research done by:

Qin Zhao1, Joon Edward Sim2, WengFai Wong1,2 1SingaporeMIT Alliance 2Department of Computer Science National University of Singapore

Credible Compilation With Pointers Martin Rinard and Darko Marinov Laboratory for Computer Science Massachusetts Institute of Technology.

PINTOS: An Execution Phase Based Optimization and Simulation Tool) PINTOS: An Execution Phase Based Optimization and Simulation Tool) Wei Hsu, Jinpyo Kim,

Mihai Burcea, J. Gregory Steffan, Cristiana Amza

Olatunji Ruwase* Shimin Chen+ Phillip B. Gibbons+ Todd C. Mowry*

Static Single Assignment

Online Subpath Profiling

Antonia Zhai, Christopher B. Colohan,

EE 382N Guest Lecture Wish Branches

Inlining and Devirtualization Hal Perkins Autumn 2011

Inlining and Devirtualization Hal Perkins Autumn 2009

Efficient software checkpointing framework for speculative techniques

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

Instruction Level Parallelism (ILP)

How to improve (decrease) CPI

rePLay: A Hardware Framework for Dynamic Optimization

Dynamic Binary Translators and Instrumenters

Peter Oostema & Rajnish Aggarwal 6th March, 2019

Presentation transcript:

Lengthening Traces to Improve Opportunities for Dynamic Optimization Chuck Zhao, Cristiana Amza, Greg Steffan, University of Toronto Youfeng Wu Intel Research Feb. 16, 2007 Interact-12, HPCA

2 Intel’s StarDBT Project StarDBT A Dynamic Binary Translation framework Operates on traces, optimizes hot traces Long term goal: Use StarDBT to allow legacy apps to exploit TM support (NOT by automatically parallelizing legacy apps) Allow speculative sequential optimizations Use hardware TM’s checkpoint/restore Problem: default traces are too small TM overheads would overwhelm benefits Challenge: lengthening traces can be tricky

3 Trace Formation A B D C FE G A B D F G basic-block profiletrace profile C E on-trace blocks off-trace stub Control flow that goes off-trace can be costly

4 A B D F G 5% 100% - 10% = 90% A B D F G A B D F G 5% Trade-offs when Lengthening Traces Tradeoffs: longer traces have more optimization opportunities longer traces have more side-exit branches Completion ratio: likelihood of execution staying on trace percentage of execution reaching trace tail side-exit ratio 100% - 25% = 75% completion ratio Sweet spot exits in between, can we find it?

5 Our Work So Far (i.e., this talk) 1. Lengthening traces while maintaining completion ratios Through unrolling and straightening A characterization of the impact on traces length, completion ratio, unroll factor, … 2. Improving optimization opportunities on longer traces Improve Local Value Numbering (LVN) hits Measurement of impact on performance is pending 3. Performing on-the-fly actions by DBT system Decisions made by instrumenting/sampling code online

6 Related Work Binary Translation Systems Dynamo DynamoRIO PIN StarDBT transparent translation x86 legacy code Trace Collection and Optimizations Java JIT Dynamo, DynamoRIO, Mojo StarDBT x86 binary level MRET 2 to improve trace formation aggressive trace optimizations First full analysis of trace-lengthening issues for DBT systems

7 StarDBT Trace Types self type other trace type elsewhere type a b c d dispatcher

8 Lengthening Traces Through Unrolling Unrolling increases trace’s length, but reduces completion ratio a aaa 90% 72.9% 81% completion ratio:

9 Finding the Sweet-Spot Unroll Factor Unroll factorCompletion ratio 1p (0.99) N (10)p 10 (0.904) …… 2p 2 (0.98) 3p 3 (0.97) given p orig = 99% and p target = 90% N (11)p 11 (0.895) aaa... Traces with 100% completion ratio: set N = 10 chosen by system designer aa

10 Lengthening Traces Through Straightening b cb c We don’t yet implement/evaluate straightening d

11 Evaluation

12 Majority of hot traces have completion ratios in 90%-100% Distribution of Original Completion Ratios Original Completion Ratios original completion ratio

13 Impact of Unrolling on Hot Trace Size Lengthening increases hot trace size by more than 36% completion ratio 36% longer Select SPECIntCPU 2000 bmarks with MinneSpec input Average Number of Instructions

14 How Much are Traces Unrolled? Hot traces are unrolled on average by 1.38x or more Target completion ratio x Average Unroll Factor Not unrolled

15 Average Completion Ratio After Lengthening Lengthening traces reduces completion ratio by < 0.5% <0.5% 10% 20% 30% 40% 50% 60% 70% 80% 90% completion ratio Completion Ratio

16 Impact of Lengthening on Optimizations

17 Local Value Numbering (LVN) No need to build Control Flow Graph (CFG) Partial info No need to perform Data Flow Analysis (DFA) Expensive, rely on CFG Can be arranged into a single-pass scan Ease of implementation Relatively light weight algorithm Performs three optimizations: Common Subexpression Elimination (CSE) Copy Propagation (CP) Dead-Code Elimination (DCE) LVN is common in JIT optimizers

18 Ex: LVN On a Lengthened Trace … c = a + b d = a e = b Original Traces … 312 c 3 = a 1 + b 2 11 d 1 = a 1 22 e 2 = b f 3 = d 1 + e 2 33 f 3 = c 3 44 d 4 = x 4 … CSE hit DCE hit … c = a + b e = b f = c d = x … Lengthened TraceOptimized Trace f = d + e d = x …

19 LVN Hits Improvement (%) 10+% more LVN hits are available through lengthening 35% 30% 25% 20% 15% 10% 5% % Increase in LVN Hits target completion ratio

20 Ongoing Work Complete DBT Optimization Framework Evaluate speculative optimizations on long hot traces with high completion ratios Automatically determine optimal transaction granularity Use HTM to support trace-based speculative optimizations

21 Control Speculation cmp … 10-% ld x=[y] 90+% ld.s x = [y] if(c){ chk.s x, recovery next: … } recovery: ld x=[y] jmp next A Compiler Framework for Speculative Analysis and Optimizations: Lin et. al, PLDI 03

22 Use HTM to Support Trace-based Speculative Optimizations cmp … 10-% ld x=[y] 90+% start_tx ld x = [y] if(c){ chk x, abort_tx … } commit_tx Use longer traces with high completion ratio as tx granularity HTM hardware support simplifies speculative optimization

23 Conclusion Traces can be effectively lengthened increase in trace size by 36+% decrease completion ratio by less than 0.5% Longer traces provide better opportunities for optimization increase in LVN hits by 10%+

24 Q + A

25 Complete StarDBT Optimization Framework X86 CISIC ISA code patching won’t work Really need a code generator and IR Design + implement a low-level Runtime IR close to hardware capture + represent all necessary low-level info easy to convert from/to machine code easy to implement analysis and optimizations Starting point Dynamo IR LLVM IR GCC RTL …

26 StarDBT Overall Structure

27 Trace Formation Heuristics MRET: Most Recent Execution Tail originally proposed by Dynamo Trace head loop head (backward branch target) sampling counter reaches a certain threshold Trace tail satisfy certain trace-tail conditions MRET 2 : 2-pass MRET perform 2 independent MRET trace formation intersect traces with common head

28 Traces and Hot Traces Trace MRET 2 recognize trace heads Trace tails satisfy certain conditions Blocks in between become a trace Hot Trace Based on recognized Traces Put in additional software counters head: head counter each early-exit branch: off-trace counters sampling: hot-trace’s completion ratio

29