Control Flow IV: Control Flow Optimizations Start Dataflow Analysis

Slides:



Advertisements
Similar presentations
Data-Flow Analysis II CS 671 March 13, CS 671 – Spring Data-Flow Analysis Gather conservative, approximate information about what a program.
Advertisements

Lecture 11: Code Optimization CS 540 George Mason University.
Chapter 9 Code optimization Section 0 overview 1.Position of code optimizer 2.Purpose of code optimizer to get better efficiency –Run faster –Take less.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
Course Outline Traditional Static Program Analysis –Theory Compiler Optimizations; Control Flow Graphs Data-flow Analysis – today’s class –Classic analyses.
Chapter 10 Code Optimization. A main goal is to achieve a better performance Front End Code Gen Intermediate Code source Code target Code user Machine-
Components of representation Control dependencies: sequencing of operations –evaluation of if & then –side-effects of statements occur in right order Data.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Dataflow Analysis Introduction Guo, Yao Part of the slides are adapted from.
Program Representations. Representing programs Goals.
1 CS 201 Compiler Construction Lecture 7 Code Optimizations: Partial Redundancy Elimination.
1 Data flow analysis Goal : collect information about how a procedure manipulates its data This information is used in various optimizations For example,
EECS 583 – Class 6 Dataflow Analysis University of Michigan September 26, 2011.
Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.
CS 536 Spring Intermediate Code. Local Optimizations. Lecture 22.
Intermediate Code. Local Optimizations
Topic 6 -Code Generation Dr. William A. Maniatty Assistant Prof. Dept. of Computer Science University At Albany CSI 511 Programming Languages and Systems.
Improving Code Generation Honors Compilers April 16 th 2002.
Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.
Machine-Independent Optimizations Ⅰ CS308 Compiler Theory1.
Direction of analysis Although constraints are not directional, flow functions are All flow functions we have seen so far are in the forward direction.
Precision Going back to constant prop, in what cases would we lose precision?
Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
Optimization software for apeNEXT Max Lukyanov,  apeNEXT : a VLIW architecture  Optimization basics  Software optimizer for apeNEXT  Current.
U NIVERSITY OF M ASSACHUSETTS, A MHERST D EPARTMENT OF C OMPUTER S CIENCE Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.
CSc 453 Final Code Generation Saumya Debray The University of Arizona Tucson.
1 Code Generation Part II Chapter 9 COP5621 Compiler Construction Copyright Robert van Engelen, Florida State University, 2005.
Dataflow Analysis Topic today Data flow analysis: Section 3 of Representation and Analysis Paper (Section 3) NOTE we finished through slide 30 on Friday.
1 Code optimization “Code optimization refers to the techniques used by the compiler to improve the execution efficiency of the generated object code”
Cleaning up the CFG Eliminating useless nodes & edges C OMP 512 Rice University Houston, Texas Fall 2003 Copyright 2003, Keith D. Cooper & Linda Torczon,
Dataflow II: Finish Dataflow Analysis, Start on Classical Optimizations EECS 483 – Lecture 24 University of Michigan Wednesday, November 29, 2006.
EECS 583 – Class 6 Dataflow Analysis University of Michigan September 24, 2012.
Announcements & Reading Material
CS412/413 Introduction to Compilers Radu Rugina Lecture 18: Control Flow Graphs 29 Feb 02.
1 Control Flow Graphs. 2 Optimizations Code transformations to improve program –Mainly: improve execution time –Also: reduce program size Can be done.
Cleaning up the CFG Eliminating useless nodes & edges This lecture describes the algorithm Clean, presented in Chapter 10 of EaC2e. The algorithm is due.
EECS 583 – Class 8 Classic Optimization University of Michigan October 3, 2011.
EECS 583 – Class 5 Hyperblocks, Control Height Reduction University of Michigan September 21, 2011.
EECS 583 – Class 5 Hyperblocks, Control Height Reduction University of Michigan September 19, 2012.
Code Optimization Code produced by compilation algorithms can often be improved (ideally optimized) in terms of run-time speed and the amount of memory.
Code Optimization Overview and Examples
Code Optimization.
Data Flow Analysis Suman Jana
EECS 583 – Class 2 Control Flow Analysis
Feedback directed optimization in Compaq’s compilation tools for Alpha
Topic 10: Dataflow Analysis
Code Generation Part III
University Of Virginia
Control Flow Analysis CS 4501 Baishakhi Ray.
Unit IV Code Generation
CS 201 Compiler Construction
Code Optimization Overview and Examples Control Flow Graph
Code Generation Part III
EECS 583 – Class 3 Region Formation, Predicated Execution
EECS 583 – Class 6 More Dataflow Analysis
EECS 583 – Class 8 Classic Optimization
Optimizations using SSA
Interval Partitioning of a Flow Graph
EECS 583 – Class 3 Region Formation, Predicated Execution
EECS 583 – Class 8 Finish SSA, Start on Classic Optimization
EECS 583 – Class 4 If-conversion
EECS 583 – Class 2 Control Flow Analysis
EECS 583 – Class 5 Dataflow Analysis
EECS 583 – Class 7 Static Single Assignment Form
Compiler Construction
EECS 583 – Class 9 Classic and ILP Optimization
Code Generation Part II
EECS 583 – Class 3 Region Formation, Predicated Execution
Objectives Identify advantages (and disadvantages ?) of optimizing in SSA form Given a CFG in SSA form, perform Global Constant Propagation Dead code elimination.
Objectives Identify advantages (and disadvantages ?) of optimizing in SSA form Given a CFG in SSA form, perform Global Constant Propagation Dead code elimination.
Bubble Sort begin; int A[10]; main(){ int i,j; Do 10 i = 0, 9, 1
Presentation transcript:

Control Flow IV: Control Flow Optimizations Start Dataflow Analysis EECS 483 – Lecture 22 University of Michigan Wednesday, November 22, 2006

Announcements and Reading Project 3 Get started building openimpact!! Make sure you get the environment variables set up before trying to install Reading 10.6

l_code.h: Core Lcode Data Structures L_Operand – src/dest operands Register, Macro, Literal (Int, Flt, Dbl, Label, String, Cb) Ignore ptype ctype = data type L_Oper – operations/instructions Has id, opcode, src/dest operands Ignore pred operand Connected in doubly linked list (next_op, prev_op) L_Cb – blocks More general than a basic block – single entry, 1 or more exits First/last ops, weight, src_flow, dest_flow Connected in doubly linked list (next_cb, prev_cb)

l_code.h: Core Lcode Data Structures (cont) L_Flow – control flow edges for CFG src_cb, dst_cb, weight Connected in doubly linked list (prev/next) L_Func – function First/last cbs, weight Oper and cb hash tables (id -> L_Oper/L_Cb) L_Attr – extra information L_Oper, L_Cb, and L_Func have attributes Name, array of L_Operands

Example Opti: Local Constant Propagation for (opA = cb->first_op; opA != NULL; opA = opA->next_op) { /* Match pattern */ if (!L_move_opcode (opA)) continue; if (!L_is_constant (opA->src[0])) continue; for (opB = opA->next_op; opB != NULL; opB = opB->next_op) { if (!L_is_src_operand (opA->dest[0], opB)) continue; if (!L_can_change_src_operand (opB, opA->dest[0])) continue; if (!L_no_defs_between (opA->dest[0], opA, opB)) continue; macro_flag = L_is_fragile_macro (opA->dest[0]); load_flag = 0; store_flag = 0; if (!L_no_danger (macro_flag, load_flag, store_flag, opA, opB)) break; /* Replace pattern */ for (i = 0; i < L_max_src_operand; i++) { if (L_same_operand (opA->dest[0], opB->src[i])) { opB->src[i] = L_copy_operand (opA->src[0]); } In reality, more code than this, but this is the important stuff

Running Things Manually, Debugging First, generate lc files using script compile_bench wc –c2lc tar zxvf wc.lc.tgz These are the unoptimized Lcode files Run Lopti manually Lopti –Farch=IMPACT –Fmodel=Lcode –Fopti=1 –i wc.lc –o wc.O opti=0: will disable all optimizations Compare wc.lc and wc.O, should see results of your optimizations Note, if benchmark has multiple files, need to run Lopti on each .lc file

Running Things Manually, Debugging - cont Running Lemulate manually ls *.O > list Lemulate –i list Generates a .c file for each .O file Compile the .c files with gcc gcc –g *.c $IMPACT_REL_PATH/platform/x86lin_gcc/impact_lemul_lib.o –lm Generates a.out, so now run it. Debugging – gdb Can gdb the benchmark a.out or Lopti Can also add print statements to Lopti

From Last Time: Loop Unroll Summary Goals Reduce number of executed branches inside loop Note: Type1/Type2 only Enable the overlapped execution of multiple iterations Reorder instructions between iterations Expose optimizations across iterations Type 1 – compile-time constant trip count Type 2 – counted, but trip count unknown at compile time Type 3 – non-counted

From Last Time - Class Problem Unroll both the outer loop and inner loop 2x. Apply the most aggressive style unrolling that you can, e.g., Type 1 if possible, else Type 2, else Type 3 for (i=0; i<100; i+=2) { j = i; while (j < 100) { A[j++]--; } B[i] = 0; for (i=0; i<100; i++) { j = i; while (j < 100) { A[j++]--; } B[i] = 0; j = i+1; while (j < 100) { A[j++]--; } B[i+1] = 0; Outer  type 1(known trip count) Inner  type 2 (counted loop)

From Last Time: Class Problem (cont’d) Expanding the while loop – For the problem this needs to be done for both while loops j = i tripcount = 100 – j tripcount = tripcount / 1 remainder = tc % 2 final = remainder * 1 final = final + j /* I forgot this on the unroll 2 slide */ j = i; while (j < 100) { A[j++]--; } Remloop: while (j < final) { A[j++]--; } Unrolledloop: while (j < 100) { A[j]--; A[j+1]-- j+=2;

Loop Peeling Unravel first P iterations of a loop Enable overlap of instructions from the peeled iterations with preheader instructions Increase potential for ILP Enables further optimization of main body Preheader Loop: r1 = MEM[r2 + 0] r4 = r1 * r5 r6 = r4 << 2 MEM[r3 + 0] = r6 r2 = r2 + 1 blt r2 100 Loop r1 = MEM[r2 + 0] r4 = r1 * r5 r6 = r4 << 2 MEM[r3 + 0] = r6 r2 = r2 + 1 bge r2 100 Done Iteration 1 More iters? Done:

Control Flow Opti for Acyclic Code Rather simple transformations with these goals: Reduce the number of dynamic branches Make larger basic blocks Reduce code size Classic control flow optimizations Branch to unconditional branch Unconditional branch to branch Branch to next basic block Basic block merging Branch to same target Branch target expansion Unreachable code elimination

Acyclic Control Flow Optimizations (1) 1. Branch to unconditional branch L1: if (a < b) goto L2 . . . L2: goto L3 L1: if (a < b) goto L3 . . . L2: goto L3  may be deleted 2. Unconditional branch to branch L1: if (a < b) goto L3 goto L4: . . . L2: if (a < b) goto L3  may be deleted L4: L1: goto L2 . . . L2: if (a < b) goto L3 L4:

Acyclic Control Flow Optimizations (2) 3. Branch to next basic block Branch is unnecessary . . . L1: if (a < b) goto L2 . . . L1: BB1 BB1 L2: . . . L2: . . . BB2 BB2 4. Basic block merging Merge BBs when single edge between . . . L1: L2: . . . L1: BB1 BB1 L2: . . . BB2

Acyclic Control Flow Optimizations (3) 5. Branch to same target . . . L1: if (a < b) goto L2 goto L2 . . . L1: goto L2 6. Branch target expansion stuff1 L1: stuff2 . . . stuff1 L1: goto L2 BB1 BB1 . . . . . . L2: stuff2 . . . L2: stuff2 . . . BB2 BB2 What about expanding a conditional branch? -- Almost the same

Unreachable Code Elimination Algorithm entry Mark procedure entry BB visited to_visit = procedure entry BB while (to_visit not empty) { current = to_visit.pop() for (each successor block of current) { Mark successor as visited; to_visit += successor } Eliminate all unvisited blocks bb1 bb2 bb3 bb4 bb5 Which BB(s) can be deleted?

Class Problem Maximally optimize the control flow of this code L1: if (a < b) goto L11 L2: goto L7 L3: goto L4 L4: stuff4 L5: if (c < d) goto L15 L6: goto L2 L7: if (c < d) goto L13 L8: goto L12 L9: stuff 9 L10: if (a < c) goto L3 L11:goto L9 L12: goto L2 L13: stuff 13 L14: if (e < f) goto L11 L15: stuff 15 L16: rts

Profile-based Control Flow Optimization: Trace Selection Trace - Linear collection of basic blocks that tend to execute in sequence “Likely control flow path” Acyclic (outer backedge ok) Side entrance – branch into the middle of a trace Side exit – branch out of the middle of a trace 10 BB1 90 80 20 BB2 BB3 80 20 BB4 10 BB5 90 10 BB6 10

Linearizing a Trace 10 (entry count) BB1 20 (side exit) 80 BB2 BB3 exit count) 80 20 (side entrance) BB4 10 (side exit) BB5 90 10 (side entrance) BB6 10 (exit count)

Intelligent Trace Layout for Icache Performance BB1 Intraprocedural code placement Procedure positioning Procedure splitting trace1 BB2 trace 2 BB4 BB6 trace 3 BB3 BB5 The rest Procedure view Trace view

Dataflow Analysis + Optimization Control flow analysis Treat BB as black box Just care about branches Now ... Start looking at operations in BBs What’s computed and where Classical optimizations Make the computation more efficient Get rid of redundancy Simplify Ex: Common Subexpression Elimination Is r2 + r3 redundant? What about r4 - r5? What if there were 1000 BB’s Dataflow analysis !! r1 = r2 + r3 r6 = r4 – r5 r4 = 4 r6 = 8 r6 = r2 + r3 r7 = r4 – r5

Dataflow Analysis Introduction Dataflow analysis – Collection of information that summarizes the creation/destruction of values in a program. Used to identify legal optimization opportunities. r1 = r2 + r3 r6 = r4 – r5 Pick an arbitrary point in the program Which VRs contain useful data values? (liveness or upward exposed uses) Which definitions may reach this point? (reaching defns) Which definitions are guaranteed to reach this point? (available defns) Which uses below are exposed? (downward exposed uses) r4 = 4 r6 = 8 r6 = r2 + r3 r7 = r4 – r5

Live Variable (Liveness) Analysis Defn: For each point p in a program and each variable y, determine whether y can be used before being redefined starting at p Algorithm sketch For each BB, y is live if it is used before defined in the BB or it is live leaving the block Backward dataflow analysis as propagation occurs from uses upwards to defs 4 sets USE = set of external variables consumed in the BB DEF = set of variables defined in the BB IN = set of variables that are live at the entry point of a BB OUT = set of variables that are live at the exit point of a BB

Liveness Example r2, r3, r4, r5 are all live as they are consumed later, r6 is dead as it is redefined later r4 is dead, as it is redefined. So is r6. r2, r3, r5 are live r4 = 4 r6 = 8 r6 = r2 + r3 r7 = r4 – r5 What does this mean? r6 = r4 – r5 is useless, it produces a dead value !! Get rid of it!

Compute USE/DEF Sets For Each BB for each basic block in the procedure, X, do DEF(X) = 0 USE(X) = 0 for each operation in sequential order in X, op, do for each source operand of op, src, do if (src not in DEF(X)) then USE(X) += src endif endfor for each destination operand of op, dest, do DEF(X) += dest def is the union of all the LHS’s use is all the VRs that are used before defined

Example USE/DEF Calculation r1 = MEM[r2+0] r2 = r2 + 1 r3 = r1 * r4 r1 = r1 + 5 r3 = r5 – r1 r7 = r3 * 2 r2 = 0 r7 = 23 r1 = 4 r8 = r7 + 5 r1 = r3 – r8 r3 = r1 * 2

Compute IN/OUT Sets For All BBs initialize IN(X) to 0 for all basic blocks X change = 1 while (change) do change = 0 for each basic block in procedure, X, do old_IN = IN(X) OUT(X) = Union(IN(Y)) for all successors Y of X IN(X) = USE(X) + (OUT(X) – DEF(X)) if (old_IN != IN(X)) then endif endfor IN = set of variables that are live when the BB is entered OUT = set of variables that are live when the BB is exited

Example IN/OUT Calculation r1 = MEM[r2+0] r2 = r2 + 1 r3 = r1 * r4 USE = r2,r4 DEF = r1,r2,r3 r1 = r1 + 5 r3 = r5 – r1 r7 = r3 * 2 r2 = 0 r7 = 23 r1 = 4 USE = r1,r5 DEF = r1,r3,r7 USE =  DEF = r1,r2,r7 r8 = r7 + 5 r1 = r3 – r8 r3 = r1 * 2 USE = r3,r7 DEF = r1,r3,r8

Class Problem Compute liveness, ie calculate USE/DEF calculate IN/OUT r2 = r3 r3 = r4 r1 = r1 + 1 r7 = r1 * r2 r2 = 0 r2 = r2 + 1 r4 = r2 + r1 r9 = r4 + r8