EECS 583 – Class 8 Finish SSA, Start on Classic Optimization

Slides:

Advertisements

Similar presentations

SSA and CPS CS153: Compilers Greg Morrisett. Monadic Form vs CFGs Consider CFG available exp. analysis: statement gen's kill's x:=v 1 p v 2 x:=v 1 p v.

Advertisements

1 SSA review Each definition has a unique name Each use refers to a single definition The compiler inserts  -functions at points where different control.

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) SSA Guo, Yao.

Course Outline Traditional Static Program Analysis –Theory Compiler Optimizations; Control Flow Graphs Data-flow Analysis – today’s class –Classic analyses.

7. Optimization Prof. O. Nierstrasz Lecture notes by Marcus Denker.

Chapter 9 Code optimization Section 0 overview 1.Position of code optimizer 2.Purpose of code optimizer to get better efficiency –Run faster –Take less.

- 1 - Dominator Tree BB0 BB1 BB2BB3 BB4 BB6 BB7 BB5 BB0 BB1 BB2BB3 BB4 BB6 BB5 BB7 BBDOM0 10,1 20,1,2 30,1,3 BBDOM 40,1,3,4 50,1,3,5 60,1,3,6 70,1,7 Dom.

1 CS 201 Compiler Construction Lecture 3 Data Flow Analysis.

19 Classic Examples of Local and Global Code Optimizations Local Constant folding Constant combining Strength reduction.

Course Outline Traditional Static Program Analysis –Theory Compiler Optimizations; Control Flow Graphs Data-flow Analysis – today’s class –Classic analyses.

CS412/413 Introduction to Compilers Radu Rugina Lecture 37: DU Chains and SSA Form 29 Apr 02.

Chapter 10 Code Optimization. A main goal is to achieve a better performance Front End Code Gen Intermediate Code source Code target Code user Machine-

C Chuen-Liang Chen, NTUCS&IE / 321 OPTIMIZATION Chuen-Liang Chen Department of Computer Science and Information Engineering National Taiwan University.

Components of representation Control dependencies: sequencing of operations –evaluation of if & then –side-effects of statements occur in right order Data.

Program Representations. Representing programs Goals.

EECS 583 – Class 6 Dataflow Analysis University of Michigan September 26, 2011.

6/9/2015© Hal Perkins & UW CSEU-1 CSE P 501 – Compilers SSA Hal Perkins Winter 2008.

Common Sub-expression Elim Want to compute when an expression is available in a var Domain:

Recap from last time We were trying to do Common Subexpression Elimination Compute expressions that are available at each program point.

Representing programs Goals. Representing programs Primary goals –analysis is easy and effective just a few cases to handle directly link related things.

1 Intermediate representation Goals: –encode knowledge about the program –facilitate analysis –facilitate retargeting –facilitate optimization scanning.

Global optimization. Data flow analysis To generate better code, need to examine definitions and uses of variables beyond basic blocks. With use- definition.

Class canceled next Tuesday. Recap: Components of IR Control dependencies: sequencing of operations –evaluation of if & then –side-effects of statements.

1 Copy Propagation What does it mean? – Given an assignment x = y, replace later uses of x with uses of y, provided there are no intervening assignments.

Recap from last time: live variables x := 5 y := x + 2 x := x + 1 y := x y...

Global optimization. Data flow analysis To generate better code, need to examine definitions and uses of variables beyond basic blocks. With use- definition.

Direction of analysis Although constraints are not directional, flow functions are All flow functions we have seen so far are in the forward direction.

Precision Going back to constant prop, in what cases would we lose precision?

CSE P501 – Compiler Construction

Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.

Dataflow IV: Loop Optimizations, Exam 2 Review EECS 483 – Lecture 26 University of Michigan Wednesday, December 6, 2006.

1 Code Generation Part II Chapter 9 COP5621 Compiler Construction Copyright Robert van Engelen, Florida State University, 2005.

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Static Single Assignment John Cavazos.

Dataflow II: Finish Dataflow Analysis, Start on Classical Optimizations EECS 483 – Lecture 24 University of Michigan Wednesday, November 29, 2006.

EECS 583 – Class 9 Classic and ILP Optimization University of Michigan October 5, 2011.

Program Representations. Representing programs Goals.

EECS 583 – Class 6 Dataflow Analysis University of Michigan September 24, 2012.

Dead Code Elimination This lecture presents the algorithm Dead from EaC2e, Chapter 10. That algorithm derives, in turn, from Rob Shillner’s unpublished.

EECS 583 – Class 8 Classic Optimization University of Michigan October 3, 2011.

Building SSA Form, I 1COMP 512, Rice University Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 512 at.

Single Static Assignment Intermediate Representation (or SSA IR) Many examples and pictures taken from Wikipedia.

Code Optimization Code produced by compilation algorithms can often be improved (ideally optimized) in terms of run-time speed and the amount of memory.

Static Single Assignment

Program Representations

Optimization Code Optimization ©SoftMoore Consulting.

Efficiently Computing SSA

Princeton University Spring 2016

Topic 10: Dataflow Analysis

Factored Use-Def Chains and Static Single Assignment Forms

Code Generation Part III

Unit IV Code Generation

Building SSA Form COMP 512 Rice University Houston, Texas Fall 2003

CSC D70: Compiler Optimization Static Single Assignment (SSA)

Code Generation Part III

Static Single Assignment Form (SSA)

EECS 583 – Class 6 More Dataflow Analysis

EECS 583 – Class 8 Classic Optimization

Optimizations using SSA

Dominator Tree First BB is the root node, each node

EECS 583 – Class 7 Static Single Assignment Form

Static Single Assignment

Optimizing Compilers CISC 673 Spring 2011 Static Single Assignment II

Reference These slides, with minor modification and some deletion, come from U. of Delaware – and the web, of course. 4/4/2019 CPEG421-05S/Topic5.

Reference These slides, with minor modification and some deletion, come from U. of Delaware – and the web, of course. 4/17/2019 CPEG421-05S/Topic5.

EECS 583 – Class 5 Dataflow Analysis

EECS 583 – Class 7 Static Single Assignment Form

EECS 583 – Class 9 Classic and ILP Optimization

Code Generation Part II

CSE P 501 – Compilers SSA Hal Perkins Autumn /31/2019

Presentation transcript:

EECS 583 – Class 8 Finish SSA, Start on Classic Optimization University of Michigan October 1, 2012

Announcements & Reading Material Homework 2 preannouncement Extend LLVM LICM optimization to perform speculative LICM This homework is significantly harder than HW 1 – will be posted on Wednes and discussed in class on Wed Best time on performance benchmarks wins prize Start looking at LLVM code for LICM Today’s class Compilers: Principles, Techniques, and Tools, A. Aho, R. Sethi, and J. Ullman, Addison-Wesley, 1988, 9.9, 10.2, 10.3, 10.7 Material for Wednesday “Compiler Code Transformations for Superscalar-Based High-Performance Systems,” Scott Mahlke, William Chen, John Gyllenhaal, Wen-mei Hwu, Pohua Chang, and Tokuzo Kiyohara, Proceedings of Supercomputing '92, Nov. 1992, pp. 808-817

From Last Time: Static Single Assignment (SSA) Form Each assignment to a variable is given a unique name All of the uses reached by that assignment are renamed DU chains become obvious based on the register name! if ( ... ) x0 = -1 else x1 = 5 x2 = Phi(x0,x1) y = x2 if ( ... ) x = -1 else x = 5 y = x

From Last Time: Computing Dominance Frontiers BB0 BB0 BB DF 0 - 1 - 2 7 3 7 4 6 5 6 6 7 7 1 BB1 BB1 BB2 BB3 BB2 BB3 BB4 BB5 BB6 BB4 BB5 BB6 BB7 BB7 For each join point X in the CFG For each predecessor, Y, of X in the CFG Run up to the IDOM(X) in the dominator tree, adding X to DF(N) for each N between Y and IDOM(X) (or X, whichever is encountered first)

Improved Class Problem Compute dominance frontiers a = b = c = BB0 Dominator Tree BB0 BB1 BB1 BB2 c = b + a BB3 b = a + 1 a = b * c BB2 BB3 BB4 BB5 BB4 b = c - a For each join point X in the CFG For each predecessor, Y, of X in the CFG Run up to the IDOM(X) in the dominator tree, adding X to DF(N) for each N between Y and IDOM(X) (or X, whichever is encountered first) BB5 a = a - c c = b * c

SSA Step 1 - Phi Node Insertion Compute dominance frontiers Find global names (aka virtual registers) Global if name live on entry to some block For each name, build a list of blocks that define it Insert Phi nodes For each global name n For each BB b in which n is defined For each BB d in b’s dominance frontier Insert a Phi node for n in d Add d to n’s list of defining BBs

Phi Node Insertion - Example BB DF 0 - 1 - 2 7 3 7 4 6 5 6 6 7 7 1 a is defined in 0,1,3 need Phi in 7 then a is defined in 7 need Phi in 1 b is defined in 0, 2, 6 then b is defined in 7 c is defined in 0,1,2,5 need Phi in 6,7 then c is defined in 7 d is defined in 2,3,4 then d is defined in 7 i is defined in BB7 need Phi in BB1 a = b = c = i = a = Phi(a,a) b = Phi(b,b) c = Phi(c,c) d = Phi(d,d) i = Phi(i,i) BB0 BB1 a = c = BB2 b = c = d = BB3 a = d = BB4 d = BB5 c = c = Phi(c,c) d = Phi(d,d) BB6 b = BB7 i = a = Phi(a,a) b = Phi(b,b) c = Phi(c,c) d = Phi(d,d)

Improved Class Problem Insert the Phi nodes Dominance frontier Dominator tree BB DF 0 - 1 - 2 4 3 4, 5 4 5 5 1 a = b = c = BB0 BB0 BB1 BB1 BB2 BB3 BB4 BB5 BB2 c = b + a BB3 b = a + 1 a = b * c BB4 b = c - a BB5 a = a - c c = b * c

SSA Step 2 – Renaming Variables Use an array of stacks, one stack per global variable (VR) Algorithm sketch For each BB b in a preorder traversal of the dominator tree Generate unique names for each Phi node Rewrite each operation in the BB Uses of global name: current name from stack Defs of global name: create and push new name Fill in Phi node parameters of successor blocks Recurse on b’s children in the dominator tree <on exit from b> pop names generated in b from stacks

Renaming – Example (Initial State) b = c = i = a = Phi(a,a) b = Phi(b,b) c = Phi(c,c) d = Phi(d,d) i = Phi(i,i) BB0 BB1 a = c = var: a b c d i ctr: 0 0 0 0 0 stk: a0 b0 c0 d0 i0 BB2 b = c = d = BB3 a = d = BB4 d = BB5 c = c = Phi(c,c) d = Phi(d,d) BB6 b = BB7 i = a = Phi(a,a) b = Phi(b,b) c = Phi(c,c) d = Phi(d,d)

Renaming – Example (After BB0) c0 = i0 = a = Phi(a0,a) b = Phi(b0,b) c = Phi(c0,c) d = Phi(d0,d) i = Phi(i0,i) BB0 BB1 a = c = var: a b c d i ctr: 1 1 1 1 1 stk: a0 b0 c0 d0 i0 BB2 b = c = d = BB3 a = d = BB4 d = BB5 c = c = Phi(c,c) d = Phi(d,d) BB6 b = BB7 i = a = Phi(a,a) b = Phi(b,b) c = Phi(c,c) d = Phi(d,d)

Renaming – Example (After BB1) c0 = i0 = a1 = Phi(a0,a) b1 = Phi(b0,b) c1 = Phi(c0,c) d1 = Phi(d0,d) i1 = Phi(i0,i) BB0 BB1 a2 = c2 = var: a b c d i ctr: 3 2 3 2 2 stk: a0 b0 c0 d0 i0 a1 b1 c1 d1 i1 a2 c2 BB2 b = c = d = BB3 a = d = BB4 d = BB5 c = c = Phi(c,c) d = Phi(d,d) BB6 b = BB7 i = a = Phi(a,a) b = Phi(b,b) c = Phi(c,c) d = Phi(d,d)

Renaming – Example (After BB2) c0 = i0 = a1 = Phi(a0,a) b1 = Phi(b0,b) c1 = Phi(c0,c) d1 = Phi(d0,d) i1 = Phi(i0,i) BB0 BB1 a2 = c2 = var: a b c d i ctr: 3 3 4 3 2 stk: a0 b0 c0 d0 i0 a1 b1 c1 d1 i1 a2 b2 c2 d2 c3 BB2 b2 = c3 = d2 = BB3 a = d = BB4 d = BB5 c = c = Phi(c,c) d = Phi(d,d) BB6 b = BB7 i = a = Phi(a2,a) b = Phi(b2,b) c = Phi(c3,c) d = Phi(d2,d)

Renaming – Example (Before BB3) This just updates the stack to remove the stuff from the left path out of BB1 a0 = b0 = c0 = i0 = a1 = Phi(a0,a) b1 = Phi(b0,b) c1 = Phi(c0,c) d1 = Phi(d0,d) i1 = Phi(i0,i) BB0 BB1 a2 = c2 = var: a b c d i ctr: 3 3 4 3 2 stk: a0 b0 c0 d0 i0 a1 b1 c1 d1 i1 a2 c2 BB2 b2 = c3 = d2 = BB3 a = d = BB4 d = BB5 c = c = Phi(c,c) d = Phi(d,d) BB6 b = BB7 i = a = Phi(a2,a) b = Phi(b2,b) c = Phi(c3,c) d = Phi(d2,d)

Renaming – Example (After BB3) c0 = i0 = a1 = Phi(a0,a) b1 = Phi(b0,b) c1 = Phi(c0,c) d1 = Phi(d0,d) i1 = Phi(i0,i) BB0 BB1 a2 = c2 = var: a b c d i ctr: 4 3 4 4 2 stk: a0 b0 c0 d0 i0 a1 b1 c1 d1 i1 a2 c2 d3 a3 BB2 b2 = c3 = d2 = BB3 a3 = d3 = BB4 d = BB5 c = c = Phi(c,c) d = Phi(d,d) BB6 b = BB7 i = a = Phi(a2,a) b = Phi(b2,b) c = Phi(c3,c) d = Phi(d2,d)

Renaming – Example (After BB4) c0 = i0 = a1 = Phi(a0,a) b1 = Phi(b0,b) c1 = Phi(c0,c) d1 = Phi(d0,d) i1 = Phi(i0,i) BB0 BB1 a2 = c2 = var: a b c d i ctr: 4 3 4 5 2 stk: a0 b0 c0 d0 i0 a1 b1 c1 d1 i1 a2 c2 d3 a3 d4 BB2 b2 = c3 = d2 = BB3 a3 = d3 = BB4 d4 = BB5 c = c = Phi(c2,c) d = Phi(d4,d) BB6 b = BB7 i = a = Phi(a2,a) b = Phi(b2,b) c = Phi(c3,c) d = Phi(d2,d)

Renaming – Example (After BB5) c0 = i0 = a1 = Phi(a0,a) b1 = Phi(b0,b) c1 = Phi(c0,c) d1 = Phi(d0,d) i1 = Phi(i0,i) BB0 BB1 a2 = c2 = var: a b c d i ctr: 4 3 5 5 2 stk: a0 b0 c0 d0 i0 a1 b1 c1 d1 i1 a2 c2 d3 a3 c4 BB2 b2 = c3 = d2 = BB3 a3 = d3 = BB4 d4 = BB5 c4 = c = Phi(c2,c4) d = Phi(d4,d3) BB6 b = BB7 i = a = Phi(a2,a) b = Phi(b2,b) c = Phi(c3,c) d = Phi(d2,d)

Renaming – Example (After BB6) c0 = i0 = a1 = Phi(a0,a) b1 = Phi(b0,b) c1 = Phi(c0,c) d1 = Phi(d0,d) i1 = Phi(i0,i) BB0 BB1 a2 = c2 = var: a b c d i ctr: 4 4 6 6 2 stk: a0 b0 c0 d0 i0 a1 b1 c1 d1 i1 a2 b3 c2 d3 a3 c5 d5 BB2 b2 = c3 = d2 = BB3 a3 = d3 = BB4 d4 = BB5 c4 = c5 = Phi(c2,c4) d5 = Phi(d4,d3) BB6 b3 = BB7 i = a = Phi(a2,a3) b = Phi(b2,b3) c = Phi(c3,c5) d = Phi(d2,d5)

Renaming – Example (After BB7) c0 = i0 = a1 = Phi(a0,a4) b1 = Phi(b0,b4) c1 = Phi(c0,c6) d1 = Phi(d0,d6) i1 = Phi(i0,i2) BB0 BB1 a2 = c2 = var: a b c d i ctr: 5 5 7 7 3 stk: a0 b0 c0 d0 i0 a1 b1 c1 d1 i1 a2 b4 c2 d6 i2 a4 c6 BB2 b2 = c3 = d2 = BB3 a3 = d3 = BB4 d4 = BB5 c4 = c5 = Phi(c2,c4) d5 = Phi(d4,d3) BB6 b3 = BB7 i2 = a4 = Phi(a2,a3) b4 = Phi(b2,b3) c6 = Phi(c3,c5) d6 = Phi(d2,d5) Fin!

Improved Class Problem Rename the variables Dominance frontier BB DF 0 - 1 - 2 4 3 4, 5 4 5 5 1 a = b = c = BB0 a = Phi(a,a) b = Phi(b,b) c = Phi(c,c) BB1 BB2 c = b + a BB3 b = a + 1 a = b * c a = Phi(a,a) b = Phi(b,b) c = Phi(c,c) BB4 b = c - a a = Phi(a,a) b = Phi(b,b) c = Phi(c,c) BB5 a = a - c c = b * c

Code Optimization Make the code run faster on the target processor My (Scott’s) favorite topic !! Other objectives: Power, code size Classes of optimization 1. Classical (machine independent) Reducing operation count (redundancy elimination) Simplifying operations Generally good for any kind of machine 2. Machine specific Peephole optimizations Take advantage of specialized hardware features 3. Parallelism enhancing Increasing parallelism (ILP or TLP) Possibly increase instructions

A Tour Through the Classical Optimizations For this class – Go over concepts of a small subset of the optimizations What it is, why its useful When can it be applied (set of conditions that must be satisfied) How it works Give you the flavor but don’t want to beat you over the head Challenges Register pressure? Parallelism verses operation count

Dead Code Elimination Remove any operation who’s result is never consumed Rules X can be deleted no stores or branches DU chain empty or dest register not live This misses some dead code!! Especially in loops Critical operation store or branch operation Any operation that does not directly or indirectly feed a critical operation is dead Trace UD chains backwards from critical operations Any op not visited is dead r1 = 3 r2 = 10 r4 = r4 + 1 r7 = r1 * r4 r2 = 0 r3 = r3 + 1 r3 = r2 + r1 store (r1, r3)

Constant Propagation Forward propagation of moves of the form rx = L (where L is a literal) Maximally propagate Assume no instruction encoding restrictions When is it legal? SRC: Literal is a hard coded constant, so never a problem DEST: Must be available Guaranteed to reach May reach not good enough r1 = 5 r2 = r1 + r3 r1 = r1 + r2 r7 = r1 + r4 r8 = r1 + 3 r9 = r1 + r11

Local Constant Propagation Consider 2 ops, X and Y in a BB, X is before Y 1. X is a move 2. src1(X) is a literal 3. Y consumes dest(X) 4. There is no definition of dest(X) between X and Y 5. No danger betw X and Y When dest(X) is a Macro reg, BRL destroys the value Note, ignore operation format issues, so all operations can have literals in either operand position r1 = 5 r2 = ‘_x’ r3 = 7 r4 = r4 + r1 r1 = r1 + r2 r1 = r1 + 1 r3 = 12 r8 = r1 - r2 r9 = r3 + r5 r3 = r2 + 1 r10 = r3 – r1

Global Constant Propagation Consider 2 ops, X and Y in different BBs 1. X is a move 2. src1(X) is a literal 3. Y consumes dest(X) 4. X is in a_in(BB(Y)) 5. Dest(x) is not modified between the top of BB(Y) and Y 6. No danger betw X and Y When dest(X) is a Macro reg, BRL destroys the value r1 = 5 r2 = ‘_x’ r1 = r1 + r2 r7 = r1 – r2 r8 = r1 * r2 r9 = r1 + r2

Constant Folding Simplify 1 operation based on values of src operands Constant propagation creates opportunities for this All constant operands Evaluate the op, replace with a move r1 = 3 * 4  r1 = 12 r1 = 3 / 0  ??? Don’t evaluate excepting ops !, what about floating-point? Evaluate conditional branch, replace with BRU or noop if (1 < 2) goto BB2  BRU BB2 if (1 > 2) goto BB2  convert to a noop Algebraic identities r1 = r2 + 0, r2 – 0, r2 | 0, r2 ^ 0, r2 << 0, r2 >> 0 r1 = r2 r1 = 0 * r2, 0 / r2, 0 & r2 r1 = 0 r1 = r2 * 1, r2 / 1

Class Problem Optimize this applying 1. constant propagation 2. constant folding r4 = 1 r7 = r1 * 4 r6 = 8 if (r3 > 0) r2 = 0 r6 = r6 * r7 r3 = r2 / r6 r3 = r4 r3 = r3 + r2 r1 = r6 r2 = r2 + 1 r1 = r1 + 1 if (r1 < 100) store (r1, r3)

Forward Copy Propagation Forward propagation of the RHS of moves r1 = r2 … r4 = r1 + 1  r4 = r2 + 1 Benefits Reduce chain of dependences Eliminate the move Rules (ops X and Y) X is a move src1(X) is a register Y consumes dest(X) X.dest is an available def at Y X.src1 is an available expr at Y r1 = r2 r3 = r4 r2 = 0 r6 = r3 + 1 r5 = r2 + r3

CSE – Common Subexpression Elimination Eliminate recomputation of an expression by reusing the previous result r1 = r2 * r3  r100 = r1 … r4 = r2 * r3  r4 = r100 Benefits Reduce work Moves can get copy propagated Rules (ops X and Y) X and Y have the same opcode src(X) = src(Y), for all srcs expr(X) is available at Y if X is a load, then there is no store that may write to address(X) along any path between X and Y r1 = r2 * r6 r3 = r4 / r7 r2 = r2 + 1 r6 = r3 * 7 r5 = r2 * r6 r8 = r4 / r7 r9 = r3 * 7 if op is a load, call it redundant load elimination rather than CSE

Class Problem r4 = r1 r6 = r15 r2 = r3 * r4 r8 = r2 + r5 r9 = r Optimize this applying 1. dead code elimination 2. forward copy propagation 3. CSE r4 = r1 r6 = r15 r2 = r3 * r4 r8 = r2 + r5 r9 = r r7 = load(r2) if (r2 > r8) r5 = r9 * r4 r11 = r2 r12 = load(r11) if (r12 != 0) r3 = load(r2) r10 = r3 / r6 r11 = r8 store (r11, r7) store (r12, r3)