Comparison and Evaluation of Back Translation Algorithms for Static Single Assignment Form Masataka Sassa #, Masaki Kohama + and Yo Ito # # Dept. of Mathematical.

Slides:

Advertisements

Similar presentations

CSC 4181 Compiler Construction Code Generation & Optimization.

Advertisements

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

SSA and CPS CS153: Compilers Greg Morrisett. Monadic Form vs CFGs Consider CFG available exp. analysis: statement gen's kill's x:=v 1 p v 2 x:=v 1 p v.

8. Static Single Assignment Form Marcus Denker. © Marcus Denker SSA Roadmap  Static Single Assignment Form (SSA)  Converting to SSA Form  Examples.

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) SSA Guo, Yao.

P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.

7. Optimization Prof. O. Nierstrasz Lecture notes by Marcus Denker.

1 Authors: Vugranam C. Sreedhar, Roy Dz-Ching Ju, David M. Gilles and Vatsa Santhanam Reader: Pushpinder Kaur Chouhan Translating Out of Static Single.

Course Outline Traditional Static Program Analysis Software Testing

Lecture 11: Code Optimization CS 540 George Mason University.

1 CS 201 Compiler Construction Lecture 3 Data Flow Analysis.

Loops or Lather, Rinse, Repeat… CS153: Compilers Greg Morrisett.

COMPILERS Register Allocation hussein suleman uct csc305w 2004.

ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto

1 CS 201 Compiler Construction Machine Code Generation.

1 Chapter 8: Code Generation. 2 Generating Instructions from Three-address Code Example: D = (A*B)+C =* A B T1 =+ T1 C T2 = T2 D.

CS412/413 Introduction to Compilers Radu Rugina Lecture 37: DU Chains and SSA Form 29 Apr 02.

Chapter 10 Code Optimization. A main goal is to achieve a better performance Front End Code Gen Intermediate Code source Code target Code user Machine-

Register Allocation CS 671 March 27, CS 671 – Spring Register Allocation - Motivation Consider adding two numbers together: Advantages: Fewer.

Components of representation Control dependencies: sequencing of operations –evaluation of if & then –side-effects of statements occur in right order Data.

Program Representations. Representing programs Goals.

6/9/2015© Hal Perkins & UW CSEU-1 CSE P 501 – Compilers SSA Hal Perkins Winter 2008.

1 CS 201 Compiler Construction Lecture 5 Code Optimizations: Copy Propagation & Elimination.

Representing programs Goals. Representing programs Primary goals –analysis is easy and effective just a few cases to handle directly link related things.

Cpeg421-08S/final-review1 Course Review Tom St. John.

Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.

1 Intermediate representation Goals: –encode knowledge about the program –facilitate analysis –facilitate retargeting –facilitate optimization scanning.

November 29, 2005Christopher Tuttle1 Linear Scan Register Allocation Massimiliano Poletto (MIT) and Vivek Sarkar (IBM Watson)

9. Optimization Marcus Denker. 2 © Marcus Denker Optimization Roadmap  Introduction  Optimizations in the Back-end  The Optimizer  SSA Optimizations.

Prof. Bodik CS 164 Lecture 171 Register Allocation Lecture 19.

Register Allocation (via graph coloring)

Register Allocation (via graph coloring). Lecture Outline Memory Hierarchy Management Register Allocation –Register interference graph –Graph coloring.

Improving Code Generation Honors Compilers April 16 th 2002.

Recap from last time: live variables x := 5 y := x + 2 x := x + 1 y := x y...

Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.

4/29/09Prof. Hilfinger CS164 Lecture 381 Register Allocation Lecture 28 (from notes by G. Necula and R. Bodik)

Optimizing Compilers Nai-Wei Lin Department of Computer Science and Information Engineering National Chung Cheng University.

CSE P501 – Compiler Construction

Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.

CMPE 511 Computer Architecture A Faster Optimal Register Allocator Betül Demiröz.

Predicated Static Single Assignment (PSSA) Presented by AbdulAziz Al-Shammari

Static Single Assignment Form in the COINS Compiler Infrastructure - Current Status and Background - Masataka Sassa, Toshiharu Nakaya, Masaki Kohama (Tokyo.

Building SSA Form, III 1COMP 512, Rice University This lecture presents the problems inherent in out- of-SSA translation and some ways to solve them. Copyright.

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Static Single Assignment John Cavazos.

Synopsys University Courseware Copyright © 2012 Synopsys, Inc. All rights reserved. Compiler Optimization and Code Generation Lecture - 1 Developed By:

Static Single Assignment Form in the COINS Compiler Infrastructure Masataka Sassa, Toshiharu Nakaya, Masaki Kohama, Takeaki Fukuoka and Masahito Takahashi.

Cleaning up the CFG Eliminating useless nodes & edges C OMP 512 Rice University Houston, Texas Fall 2003 Copyright 2003, Keith D. Cooper & Linda Torczon,

Register Allocation CS 471 November 12, CS 471 – Fall 2007 Register Allocation - Motivation Consider adding two numbers together: Advantages: Fewer.

PLC '06 Experience in Testing Compiler Optimizers Using Comparison Checking Masataka Sassa and Daijiro Sudo Dept. of Mathematical and Computing Sciences.

3/2/2016© Hal Perkins & UW CSES-1 CSE P 501 – Compilers Optimizing Transformations Hal Perkins Autumn 2009.

CS412/413 Introduction to Compilers and Translators April 2, 1999 Lecture 24: Introduction to Optimization.

Optimization Simone Campanoni

Credible Compilation With Pointers Martin Rinard and Darko Marinov Laboratory for Computer Science Massachusetts Institute of Technology.

Single Static Assignment Intermediate Representation (or SSA IR) Many examples and pictures taken from Wikipedia.

Code Optimization Overview and Examples

Global Register Allocation Based on

High-level optimization Jakub Yaghob

Code Optimization.

Introduction to Advanced Topics Chapter 1 Text Book: Advanced compiler Design implementation By Steven S Muchnick (Elsevier)

Static Single Assignment

Machine-Independent Optimization

Code Generation Part III

CS 201 Compiler Construction

Code Optimization Overview and Examples Control Flow Graph

Code Generation Part III

Optimizations using SSA

Optimizing Compilers CISC 673 Spring 2011 Static Single Assignment II

CSE P 501 – Compilers SSA Hal Perkins Autumn /31/2019

Code Optimization.

Presentation transcript:

Comparison and Evaluation of Back Translation Algorithms for Static Single Assignment Form Masataka Sassa #, Masaki Kohama + and Yo Ito # # Dept. of Mathematical and Computing Sciences, Tokyo Institute of Technology + Fuji Photo Film Co., Ltd

Background SSA form (static single assignment form) Good representation for optimizing compilers Cannot be trivially translated into object code such as machine code SSA back translation (translation to normal form) Several algorithms exist, whose translation results differ No research for their comparison and evaluation Algorithm by Briggs et al. published earlier is often adopted without much consideration

Outline Comparison of SSA back translation algorithms Algorithm by Briggs et al. Algorithm by Sreedhar et al. A proposal for improving Briggs’ algorithm Comparison by experiments Changing the no. of registers and combination of optimizations Execution time difference is not negligible (by 7%~10%) Give a criterion in selecting SSA back translation algorithm

1.SSA form 2.SSA back translation 3.Two major algorithms for SSA back translation 4.Improvement of Briggs’ algorithm 5.Experimental Results 6.Conclusion Contents

1 Static single assignment form (SSA form)

x = 1 y = 2 a = x + y a = a + 3 b = x + y Static single assignment form (SSA form) (a) Normal form x 1 = 1 y 1 = 2 a 1 = x 1 + y 1 a 2 = a b 1 = x 1 + y 1 (b) SSA form Only one definition for each variable. For each use of variable, only one definition reaches.

a = x + y a = a + 3 b = x + y Optimization in static single assignment (SSA) form (a) Normal form a 1 = x 0 + y 0 a 2 = a b 1 = x 0 + y 0 (b) SSA form a 1 = x 0 + y 0 a 2 = a b 1 = a 1 (c) After SSA form optimization a1 = x0 + y0 a2 = a1 + 3 b1 = a1 (d) Optimized normal form SSA translation Optimization in SSA form (common subexpression elimination) SSA back translation SSA form is becoming increasingly popular in compilers, since it is suited for clear handling of dataflow analysis (definition and use) and optimization.

x1 = 1x2 = 2 x3 =   (x1;L1, x2:L2) … = x3 x = 1x = 2 … = x L3 L2L1 L2 L3 (b) SSA form(a) Normal form Translation into SSA form (SSA translation)  - function is a hypothetical function to make definition point unique

2Back translation from SSA form into normal form (SSA back translation)

Back translation from SSA form into normal form (SSA back translation) x1 = 1x2 = 2 x3 =   (x1;L1, x2:L2) … = x3 x1 = 1 x3 = x1 x2 = 2 x3 = x2 … = x3 L3 L2L1 L2 L3 (a) SSA form(b) Normal form Insert copy statements in the predecessor blocks of  - function and delete .  - function must be deleted before code generation.

Problems of naïve SSA back translation (i) Lost copy problem x0 = 1 x1 =  (x0, x2) y = x1 x2 = 2 return y x0 = 1 x1 =  (x0, x2) x2 = 2 return x1 x0 = 1 x1 = x0 x2 = 2 x1 = x2 return x1 not correct ! block1 block3 block2 block1 block3 block2 block1 block3 block2 Copy propagation Back translation by naïve method

(ii) Simple ordering problem: simultaneous assignments to  -functions Problems of naïve SSA back translation x0 = 1 y0 = 2 x1 =  (x0, x2) y1 =  (y0, y2) y2 = x1 x2 = 3 not correct ! block1 block2 Copy propagation Back translation by naïve method x0 = 1 y0 = 2 x1 =  (x0, x2) y1 =  (y0, x1) x2 = 3 block1 block2 x0 = 1 y0 = 2 x1 = x0 y1 = y0 x2 = 3 x1 = x2 y1 = x1 block1 block2

To remedy these problems... SSA back translation algorithms by (i) Briggs et al. [1998] Insert copy statements (ii) Sreedhar et al. [1999] Eliminate interference 3 Two major algorithms for SSA back translation

(i) SSA back translation algorithm by Briggs (lost copy problem) x0 = 1 x1 =  (x0, x2) x2 = 2 return x1 x0 = 1 x1 = x0 x2 = 2 x1 = x2 return temp block1 block3 block2 block1 block3 block2 temp = x1 (a) SSA form (b) normal form after back translation Many copies inserted. Is is OK? live range of x1 live range of temp

(i) SSA back translation algorithm by Briggs (cont) x = 1 x = 2 return temp block1 block3 block2 temp = x (d) after coalescing x0 = 1 x1 = x0 x2 = 2 x1 = x2 return temp block1 block3 block2 temp = x1 (c) normal form after back translation (same as (b)) live range of x0 x1 x2 coalesce {x0, x1, x2}  x (delete relevant copies)

(i) SSA back translation algorithm by Briggs (cont) Possible problems: Many copies are inserted. They claim most copies can be coalesced. Actually there are many copies which interfere, thus cannot be coalesced. This increases register pressure (demand for registers). Problems when processing  -functions within loops Causes copies that cannot be coalesced We propose an improvement

(ii) SSA back translation algorithm by Sreedhar - principle x1 = 1x2 = 2 x3 =   (x1;L1, x2:L2) … = x3 X = 1 X = 2 … = X L3 L2 L1 L2 L3 (a) SSA form (b) Normal form { x3, x1, x2 } => X If the live ranges of variables in  - function (x3, x1, x2) do not interfere, then coalesce them into a single variable (X), and delete the  -function.

(ii) SSA back translation algorithm by Sreedhar - rewriting x1 =  (x2,...)... x2 = x1’ =  (x2,...) x1 = x1’... x2 = x1 =  (x2’,...)... x2 = x2’ = x2 (a) SSA form(b) rewrite target (c) rewrite parameter If variables in  -function interfere, rewrite the target or the parameter.

(ii) SSA back translation algorithm by Sreedhar (lost copy problem) x0 = 1 x1 =  (x0, x2) x2 = 2 return x1 live range of x0 x1 x2 x0 = 1 x1’ =  (x0, x2) x1 = x1’ x2 = 2 return x1 rename {x1 ’, x0, x2}  X delete  x1 = X X = 2 X = 1 block1 block3 block2 block1 block3 block2 return x1 (a) SSA form (b) eliminating interference (Rewrite target) (c) normal form after back translation live range of x0 x1' x2 block1 block3 block2 interfere not interfere (Variables in  -function interfere)

(ii) SSA back translation algorithm by Sreedhar (cont) Benefit: Copies are few Possible problems: Live range of variables are long May increase register pressure

4 Problems of Briggs’ algorithm and proposal for improvement

Problems of Briggs’ algorithm - long live range x1=0 return x3 x3=  (x1, x2) x2=1 …=x3+1 x1=0 x3=x1 return temp temp=x3 x2=1 …=x3+1 x3=x2 x2 x3 temp (a) SSA form (b) Briggs’ back translation live range interfere (  -function in loop -> insert copies -> copy “x3=x2” is inserted in the live range of x3 -> need temporary temp -> copies interfere -> copies cannot be coalesced)

Improvement of Briggs’ algorithm (1) (first rewrite  -function à la Sreedhar’s method -> remove interference between x2 and x3 -> back translation -> x2 and x3’ do not interfere and can be coalesced -> better than Briggs’) x1=0 return x3 x3’=  (x1, x2) x3=x3’ x2=1 …=x3+1 x1=0 x3’=x1 return x3 x3=x3’ x2=1 …=x3+1 x3’=x2 x2 x3’ x3 live range (c) rewrite  (d) after back translation (x2 and x3’ can be coalesced)

Improvement of Briggs’ algorithm (2) - conditional branch y=x1+1 if (x1 > 10) x3=  (x1, x2) return x3 y=x1+1 x3=x1 if (x3 > 10) return x3 y=x1+1 x3=x1 if (x1 > 10) return x3 x1 x3 live range (a) SSA form(b) back translation by Briggs (c) improvement (x1 and x3 can be coalesced) no interfere interfere

Implementation COINS compiler infrastructure Adopt Sreedhar’s SSA back translation algorithm Iterated coalescing [George & Appel et al. 1996] Implemented in Java SSA optimization module of COINS Briggs’s algorithm and the proposed improvement are additionally implemented for measurement Implemented in Java

5 Experimental results using SPEC benchmarks

Consideration up to now Briggs’s algorithm many interfering copies enough no. of registers -> disadvantageous small no. of registers register pressure increases due to interfering copies Sreedhar’s algorithm interfering copies are few, but live ranges of variables are long enough no. of registers -> advantageous small no. of registers register pressure increases due to long live ranges of variables

Experiments Sun-Blade-1000 Benchmark (C language) SPECint2000 mcf, gzip （ simply called gzip ） Combination of optimization Optimization 1: copy propagation Optimization 2: copy propagation, dead code elimination, common subexpression elimination Optimization 3: copy propagation, loop-invariant code motion No. of registers 8 (e.g. x86) and 20 (e.g. RISC machines) architectureSPARC V9 processorUltraSPARC-III 750MHz x 2 L1 cache 64KB （ data ）， 32KB(instruction) L2 cache8MB Memory1GB OSSunOS 5.8

Experiments - viewpoints of evaluation Static Count No. of copies that cannot be coalesced No. of variables spilled in register allocation Dynamic Count No. of executed copies No. of executed load and store instructions Execution time SSA back translation optimization SSA translation register allocation object program in SPARC source program in C

No. of copies that cannot be coalesced (static count) (relative value: Briggs = 1, small is better)

No. of variables spilled (static count) (unit: absolute value, small is better)

No. of executed copies (dynamic count) (relative value: Briggs = 1, small is better)

No. of executed load/store instructions (dynamic count) (relative value: Briggs = 1, small is better)

Execution time (relative value: Briggs = 1, small is better)

Additional experiments More benchmarks is executed on more general optimizations for Briggs’ and Sreedhar’s algorithms Benchmarks SPECint2000 gzip, vpr, mcf, parser, bzip2, twolf Combination of optimizations No optimization Optimization 1: copy propagation, constant propagation, dead code elimination Optimization 2: loop-invariant motion, constant propagation, common subexpression elimination, copy propagation, dead code elimination Execution time is measured

Execution time (relative value: Briggs = 1, small is better)

Summary of experiments Briggs’s algorithm gave many copies that cannot be coalesced, which is beyond our expectation Case of 20 registers Sreedhar’s algorithm is advantageous due to the dynamic count of copies It gives better execution time by around 1%~10% Case of 8 registers Sreedhar’s algorithm is advantageous due to both the dynamic count of copies, and the dynamic count of load/store instructions It gives better execution time by around 1%~5% Effect of SSA back translation algorithms does not depend so much to the combination of optimizations Our proposed improvement has some effect over Briggs’, but does not surpass Sreedhar’s algorithm

Conclusion and future work Our contribution Has shown that difference in SSA back translation algorithms influences execution time by up to 10% Experiments in a variety of situations Clarified advantage and disadvantage of different algorithms Proposed an improvement of Briggs’ algorithm Sreedhar’s algorithm is superior from experiments (up to 10%) Future work Further experiments measuring several facet using more benchmarks Consider coalescing algorithm in register allocation

Conclusion The experimental result shows that selecting a good back translation algorithm is quite important since it reduces the execution time of the object code by up to 10%, which is equal to applying a middle-level global optimization.

Appendix SSA optimization module in COINS

High-level Intermediate Representation (HIR) Basic analyzer & optimizer HIR to LIR Low-level Intermediate Representation (LIR) SSA optimizer Code generator C frontend Advanced optimizer Basic parallelizer frontend C generation SPARC SIMD parallelizer x86 New machine Fortran C New language COpenMP Fortran frontend COINS compiler infrastructure C generation C

Low-level Intermediate Representation (LIR) SSA optimization module Code generation Source program object code LIR to SSA translation (3 variations) LIR in SSA SSA optimization common subexp elim GVN by question prop copy propagation cond const propagation and much more... Optimized LIR in SSA SSA to LIR back translation 2 methods (one with 3 variations) + 2 coalescing about 12,000 lines transformation on SSA edge splitting redund  elim empty block elim SSA optimization module in COINS

Outline of SSA module in COINS (1) Translation into and back from SSA form on Low-level Intermediate Representation (LIR) ◆ SSA translation ◆ Use dominance frontier [Cytron et al. 91] ◆ 3 variations: translation into Minimal, Semi-pruned and Pruned SSA forms ◆ SSA back translation ◆ Sreedhar et al.’s method [Sreedhar et al. 99] ◆ 3 variations: Method I, II, and III ◆ Briggs et al.’s method [Briggs et al. 98] (under development) ◆ Coalescing ◆ SSA-based coalescing during SSA back translation [Sreedhar et al. 99] ◆ Chaitin’s style coalescing after back translation ◆ Each variation and coalescing can be specified by options

Outline of SSA module in COINS (2) ◆ Several optimization on SSA form: ◆ dead code elimination, copy propagation, common subexpression elimination, global value numbering based on efficient query propagation, conditional constant propagation, loop invariant code motion, operator strength reduction for induction variable and linear function test replacement, empty block elimination, copy folding at SSA translation time... ◆ Useful transformation as an infrastructure for SSA form optimization: ◆ critical edge removal on control flow graph, loop transformation from ‘while’ loop to ‘if-do-while’ loop, redundant phi-function elimination, making SSA graph … ◆ Each variation, optimization and transformation can be made selectively by specifying options

References COINS SSA module in COINS english/index.html english/index.html

Relative dynamic count of load and store Number of registers = 8 gzip mcf

Relative dynamic count of copies Number of registers = 8 gzip mcf

Relative execution time Number of registers = 8 gzip mcf

Relative execution time Number of registers = 20 gzip mcf