EECS 583 – Class 8 Classic Optimization

Slides:

Advertisements

Similar presentations

CSC 4181 Compiler Construction Code Generation & Optimization.

Advertisements

Data-Flow Analysis II CS 671 March 13, CS 671 – Spring Data-Flow Analysis Gather conservative, approximate information about what a program.

7. Optimization Prof. O. Nierstrasz Lecture notes by Marcus Denker.

Course Outline Traditional Static Program Analysis Software Testing

Lecture 11: Code Optimization CS 540 George Mason University.

Chapter 9 Code optimization Section 0 overview 1.Position of code optimizer 2.Purpose of code optimizer to get better efficiency –Run faster –Take less.

1 CS 201 Compiler Construction Lecture 3 Data Flow Analysis.

CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic 5: Peep Hole Optimization José Nelson Amaral

19 Classic Examples of Local and Global Code Optimizations Local Constant folding Constant combining Strength reduction.

1 Chapter 8: Code Generation. 2 Generating Instructions from Three-address Code Example: D = (A*B)+C =* A B T1 =+ T1 C T2 = T2 D.

Course Outline Traditional Static Program Analysis –Theory Compiler Optimizations; Control Flow Graphs Data-flow Analysis – today’s class –Classic analyses.

Chapter 10 Code Optimization. A main goal is to achieve a better performance Front End Code Gen Intermediate Code source Code target Code user Machine-

C Chuen-Liang Chen, NTUCS&IE / 321 OPTIMIZATION Chuen-Liang Chen Department of Computer Science and Information Engineering National Taiwan University.

1 Code Optimization Code produced by compilation algorithms can often be improved (ideally optimized) in terms of run-time speed and the amount of memory.

Components of representation Control dependencies: sequencing of operations –evaluation of if & then –side-effects of statements occur in right order Data.

Program Representations. Representing programs Goals.

1 CS 201 Compiler Construction Lecture 7 Code Optimizations: Partial Redundancy Elimination.

1 Copy Propagation What does it mean? Given an assignment x = y, replace later uses of x with uses of y, provided there are no intervening assignments.

1 CS 201 Compiler Construction Lecture 5 Code Optimizations: Copy Propagation & Elimination.

Recap from last time We were trying to do Common Subexpression Elimination Compute expressions that are available at each program point.

Representing programs Goals. Representing programs Primary goals –analysis is easy and effective just a few cases to handle directly link related things.

Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.

Global optimization. Data flow analysis To generate better code, need to examine definitions and uses of variables beyond basic blocks. With use- definition.

9. Optimization Marcus Denker. 2 © Marcus Denker Optimization Roadmap  Introduction  Optimizations in the Back-end  The Optimizer  SSA Optimizations.

Class canceled next Tuesday. Recap: Components of IR Control dependencies: sequencing of operations –evaluation of if & then –side-effects of statements.

U NIVERSITY OF M ASSACHUSETTS, A MHERST D EPARTMENT OF C OMPUTER S CIENCE Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.

1 Copy Propagation What does it mean? – Given an assignment x = y, replace later uses of x with uses of y, provided there are no intervening assignments.

Common Sub-expression Elim Want to compute when an expression is available in a var Domain:

Improving Code Generation Honors Compilers April 16 th 2002.

Recap from last time: live variables x := 5 y := x + 2 x := x + 1 y := x y...

Machine-Independent Optimizations Ⅰ CS308 Compiler Theory1.

Global optimization. Data flow analysis To generate better code, need to examine definitions and uses of variables beyond basic blocks. With use- definition.

Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.

Dataflow IV: Loop Optimizations, Exam 2 Review EECS 483 – Lecture 26 University of Michigan Wednesday, December 6, 2006.

U NIVERSITY OF M ASSACHUSETTS, A MHERST D EPARTMENT OF C OMPUTER S CIENCE Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.

What’s in an optimizing compiler?

1 Code optimization “Code optimization refers to the techniques used by the compiler to improve the execution efficiency of the generated object code”

Cleaning up the CFG Eliminating useless nodes & edges C OMP 512 Rice University Houston, Texas Fall 2003 Copyright 2003, Keith D. Cooper & Linda Torczon,

Dataflow II: Finish Dataflow Analysis, Start on Classical Optimizations EECS 483 – Lecture 24 University of Michigan Wednesday, November 29, 2006.

EECS 583 – Class 9 Classic and ILP Optimization University of Michigan October 5, 2011.

Program Representations. Representing programs Goals.

EECS 583 – Class 8 Classic Optimization University of Michigan October 3, 2011.

CS 412/413 Spring 2005Introduction to Compilers1 CS412/CS413 Introduction to Compilers Tim Teitelbaum Lecture 30: Loop Optimizations and Pointer Analysis.

Code Optimization Code produced by compilation algorithms can often be improved (ideally optimized) in terms of run-time speed and the amount of memory.

Code Optimization Overview and Examples

High-level optimization Jakub Yaghob

Code Optimization.

Static Single Assignment

Program Representations

Optimization Code Optimization ©SoftMoore Consulting.

Basic Block Optimizations

Princeton University Spring 2016

Princeton University Spring 2016

Fall Compiler Principles Lecture 8: Loop Optimizations

Topic 10: Dataflow Analysis

Factored Use-Def Chains and Static Single Assignment Forms

Code Generation Part III

Code Optimization Overview and Examples Control Flow Graph

Code Generation Part III

EECS 583 – Class 6 More Dataflow Analysis

Optimizations using SSA

EECS 583 – Class 7 Static Single Assignment Form

EECS 583 – Class 8 Finish SSA, Start on Classic Optimization

EECS 583 – Class 5 Dataflow Analysis

EECS 583 – Class 7 Static Single Assignment Form

EECS 583 – Class 9 Classic and ILP Optimization

CSE P 501 – Compilers SSA Hal Perkins Autumn /31/2019

Live Variables – Basic Block

Basic Block Optimizations

Code Optimization.

Presentation transcript:

EECS 583 – Class 8 Classic Optimization University of Michigan October 1, 2018

Announcements & Reading Material HW2 – Get busy on it ASAP! Today’s class Compilers: Principles, Techniques, and Tools, A. Aho, R. Sethi, and J. Ullman, Addison-Wesley, 1988, 9.9, 10.2, 10.3, 10.7 Edition 1; 8.5, 8.7, 9.1, 9.4, 9.5 Edition 2 Material for Wednesday “Compiler Code Transformations for Superscalar-Based High-Performance Systems,” S. Mahlke, W. Chen, J. Gyllenhaal, W. Hwu, P. Chang, and T. Kiyohara, Proceedings of Supercomputing '92, Nov. 1992, pp. 808-817 And if you want more on ILP optimizations: D. J. Kuck, The Structure of Computers and Computations. New York, NY: John Wiley and Sons, 1978. (optional!)

Class Problem From Last Time – Answer Rename the variables Dominator tree Dominance frontier BB0 BB DF 0 - 1 - 2 4 3 4, 5 4 5 5 1 a0 = b0 = c0 = BB0 a1 = Phi(a0,a5) b1 = Phi(b0,b5) c1 = Phi(c0,c5) BB1 BB2 BB3 BB4 BB5 BB1 BB2 a3 = Phi(a1,a2) b3 = Phi(b1,b2) c3= Phi(c2,c1) c2 = b1 + a1 BB3 b2 = a1 + 1 a2 = b1 * c1 BB4 b4 = c3 – a3 a4 = Phi(a2,a3) b5 = Phi(b2,b4) c4 = Phi(c1,c3) BB5 a5 = a4 – c4 c5 = b5 * c4

Code Optimization Make the code run faster on the target processor My (Scott’s) favorite topic !! Other objectives: Power, code size Classes of optimization 1. Classical (machine independent) Reducing operation count (redundancy elimination) Simplifying operations Generally good for any kind of machine 2. Machine specific Peephole optimizations Take advantage of specialized hardware features 3. Parallelism enhancing Increasing parallelism (ILP or TLP) Possibly increase instructions

A Tour Through the Classical Optimizations For this class – Go over concepts of a small subset of the optimizations What it is, why its useful When can it be applied (set of conditions that must be satisfied) How it works Give you the flavor but don’t want to beat you over the head Challenges Register pressure? Parallelism verses operation count

Dead Code Elimination Remove any operation who’s result is never consumed Rules X can be deleted no stores or branches DU chain empty or dest register not live This misses some dead code!! Especially in loops Better Algorithm Critical operation store or branch operation Any operation that does not directly or indirectly feed a critical operation is dead Trace UD chains backwards from critical operations Any op not visited is dead 1. r1 = 3 2. r2 = 10 BB1 3. r4 = r4 + 1 4. r7 = r1 * r4 BB2 BB3 BB4 5. r2 = 0 6. r3 = r3 + 1 BB5 7. r3 = r2 + r1 BB6 8. store (r1, r3)

Local Constant Propagation Forward propagation of moves of the form rx = L (where L is a literal) Maximally propagate Consider 2 ops, X and Y in a BB, X is before Y 1. X is a move 2. src1(X) is a literal 3. Y consumes dest(X) 4. There is no definition of dest(X) between X and Y BB1 1. r1 = 5 2. r2 = ‘_x’ 3. r3 = 7 4. r4 = r4 + r1 5. r1 = r1 + r2 6. r1 = r1 + 1 7. r3 = 12 8. r8 = r1 - r2 9. r9 = r3 + r5 10. r3 = r2 + 1 11. r10 = r3 – r1 Note, ignore operation format issues, so all operations can have literals in either operand position

Global Constant Propagation Consider 2 ops, X and Y in different BBs 1. X is a move 2. src1(X) is a literal 3. Y consumes dest(X) 4. X is in a_in(BB(Y)) 5. Dest(x) is not modified between the top of BB(Y) and Y BB1 1. r1 = 5 2. r2 = ‘_x’ BB2 BB3 3. r1 = r1 + r2 4. r7 = r1 – r2 BB4 5. r8 = r1 * r2 BB5 6. r9 = r1 + r2

Constant Folding Simplify 1 operation based on values of src operands Constant propagation creates opportunities for this All constant operands Evaluate the op, replace with a move r1 = 3 * 4  r1 = 12 r1 = 3 / 0  ??? Don’t evaluate excepting ops !, what about floating-point? Evaluate conditional branch, replace with BRU or noop if (1 < 2) goto BB2  BRU BB2 if (1 > 2) goto BB2  convert to a noop Algebraic identities r1 = r2 + 0, r2 – 0, r2 | 0, r2 ^ 0, r2 << 0, r2 >> 0 r1 = r2 r1 = 0 * r2, 0 / r2, 0 & r2 r1 = 0 r1 = r2 * 1, r2 / 1

Class Problem Optimize this applying 1. constant propagation 2. constant folding BB1 4. r4 = 1 5. r7 = r1 * 4 6. r6 = 8 7. if (r3 > 0) BB2 BB3 BB4 8. r2 = 0 9. r6 = r6 * r7 10. r3 = r2 / r6 11. r3 = r4 12. r3 = r3 + r2 13. r1 = r6 14. r2 = r2 + 1 15. r1 = r1 + 1 16. if (r1 < 100) BB5 BB6 17. store (r1, r3)

Forward Copy Propagation Forward propagation of the RHS of moves r1 = r2 … r4 = r1 + 1  r4 = r2 + 1 Benefits Reduce chain of dependences Eliminate the move Rules (ops X and Y) X is a move src1(X) is a register Y consumes dest(X) X.dest is an available def at Y X.src1 is an available expr at Y BB1 1. r1 = r2 2. r3 = r4 BB2 BB3 3. r2 = 0 4. r6 = r3 + 1 BB4 5. r5 = r2 + r3

CSE – Common Subexpression Elimination Eliminate recomputation of an expression by reusing the previous result r1 = r2 * r3  r100 = r1 … r4 = r2 * r3  r4 = r100 Benefits Reduce work Moves can get copy propagated Rules (ops X and Y) X and Y have the same opcode src(X) = src(Y), for all srcs expr(X) is available at Y if X is a load, then there is no store that may write to address(X) along any path between X and Y BB1 1. r1 = r2 * r6 2. r3 = r4 / r7 BB2 BB3 3. r2 = r2 + 1 4. r6 = r3 * 7 BB4 5. r5 = r2 * r6 6. r8 = r4 / r7 7. r9 = r3 * 7 if op is a load, call it redundant load elimination rather than CSE

Class Problem 1. r4 = r1 2. r6 = r15 3. r2 = r3 * r4 4. r8 = r2 + r5 Optimize this applying 1. dead code elimination 2. forward copy propagation 3. CSE BB1 1. r4 = r1 2. r6 = r15 3. r2 = r3 * r4 4. r8 = r2 + r5 5. r9 = r3 6. r7 = load(r2) 7. if (r2 > r8) BB2 BB3 8. r5 = r9 * r4 9. r11 = r2 10. r12 = load(r11) 11. if (r12 != 0) 12. r3 = load(r2) 13. r10 = r3 / r6 14. r11 = r8 15. store (r11, r7) BB4 16. store (r12, r3)

Loop Invariant Code Motion (LICM) Move operations whose source operands do not change within the loop to the loop preheader Execute them only 1x per invocation of the loop Be careful with memory operations! Be careful with ops not executed every iteration BB1 1. r1 = 3 2. r5 = &A BB2 3. r4 = load(r5) 4. r7 = r4 * 3 BB3 BB4 5. r8 = r2 + 1 6. r7 = r8 * r4 7. r3 = r2 + 1 BB5 8. r1 = r1 + r7 BB6 9. store (r1, r3)

LICM (2) Rules 1. r1 = 3 2. r5 = &A 3. r4 = load(r5) 4. r7 = r4 * 3 X can be moved src(X) not modified in loop body X is the only op to modify dest(X) for all uses of dest(X), X is in the available defs set for all exit BB, if dest(X) is live on the exit edge, X is in the available defs set on the edge if X not executed on every iteration, then X must provably not cause exceptions if X is a load or store, then there are no writes to address(X) in loop BB1 1. r1 = 3 2. r5 = &A BB2 3. r4 = load(r5) 4. r7 = r4 * 3 BB3 BB4 5. r8 = r2 + 1 6. r7 = r8 * r4 7. r3 = r2 + 1 BB5 8. r1 = r1 + r7 BB6 9. store (r1, r3)

Global Variable Migration Assign a global variable temporarily to a register for the duration of the loop Load in preheader Store at exit points Rules X is a load or store address(X) not modified in the loop if X not executed on every iteration, then X must provably not cause an exception All memory ops in loop whose address can equal address(X) must always have the same address as X BB1 BB2 1. r4 = load(r5) 2. r4 = r4 + 1 BB3 BB4 3. r8 = load(r5) 4. r7 = r8 * r4 5. store(r5, r4) BB5 6. store(r5,r7) BB6

Induction Variable Strength Reduction Create basic induction variables from derived induction variables Induction variable BIV (i++) 0,1,2,3,4,... DIV (j = i * 4) 0, 4, 8, 12, 16, ... DIV can be converted into a BIV that is incremented by 4 Issues Initial and increment vals Where to place increments BB1 BB2 1. r5 = r4 - 3 2. r4 = r4 + 1 BB3 BB4 3. r7 = r4 * r9 BB5 4. r6 = r4 << 2 BB6

Induction Variable Strength Reduction (2) Rules X is a *, <<, + or – operation src1(X) is a basic ind var src2(X) is invariant No other ops modify dest(X) dest(X) != src(X) for all srcs dest(X) is a register Transformation Insert the following into the preheader new_reg = RHS(X) If opcode(X) is not add/sub, insert to the bottom of the preheader new_inc = inc(src1(X)) opcode(X) src2(X) else new_inc = inc(src1(X)) Insert the following at each update of src1(X) new_reg += new_inc Change X  dest(X) = new_reg BB1 BB2 1. r5 = r4 - 3 2. r4 = r4 + 1 BB3 BB4 3. r7 = r4 * r9 BB5 4. r6 = r4 << 2 BB6

Class Problem Optimize this applying 1. r1 = 0 induction var str reduction BB1 1. r1 = 0 2. r2 = 0 BB2 3. r5 = r5 + 1 4. r11 = r5 * 2 5. r10 = r11 + 2 6. r12 = load (r10+0) 7. r9 = r1 << 1 8. r4 = r9 - 10 9. r3 = load(r4+4) 10. r3 = r3 + 1 11. store(r4+0, r3) 12. r7 = r3 << 2 13. r6 = load(r7+0) 14. r13 = r2 - 1 15. r1 = r1 + 1 16. r2 = r2 + 1 BB3 r13, r12, r6, r10 liveout