EECS 583 – Class 9 Classic and ILP Optimization University of Michigan October 5, 2011.

Slides:

Advertisements

Similar presentations

CSC 4181 Compiler Construction Code Generation & Optimization.

Advertisements

8. Static Single Assignment Form Marcus Denker. © Marcus Denker SSA Roadmap  Static Single Assignment Form (SSA)  Converting to SSA Form  Examples.

7. Optimization Prof. O. Nierstrasz Lecture notes by Marcus Denker.

Course Outline Traditional Static Program Analysis Software Testing

Lecture 11: Code Optimization CS 540 George Mason University.

Chapter 9 Code optimization Section 0 overview 1.Position of code optimizer 2.Purpose of code optimizer to get better efficiency –Run faster –Take less.

19 Classic Examples of Local and Global Code Optimizations Local Constant folding Constant combining Strength reduction.

VLIW Very Large Instruction Word. Introduction Very Long Instruction Word is a concept for processing technology that dates back to the early 1980s. The.

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.

ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto

ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:

1 CS 201 Compiler Construction Machine Code Generation.

Course Outline Traditional Static Program Analysis –Theory Compiler Optimizations; Control Flow Graphs Data-flow Analysis – today’s class –Classic analyses.

Control-Flow Graphs & Dataflow Analysis CS153: Compilers Greg Morrisett.

EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

Chapter 10 Code Optimization. A main goal is to achieve a better performance Front End Code Gen Intermediate Code source Code target Code user Machine-

1 Code Optimization Code produced by compilation algorithms can often be improved (ideally optimized) in terms of run-time speed and the amount of memory.

Program Representations. Representing programs Goals.

1 CS 201 Compiler Construction Lecture 5 Code Optimizations: Copy Propagation & Elimination.

Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.

CS 536 Spring Intermediate Code. Local Optimizations. Lecture 22.

EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

Global optimization. Data flow analysis To generate better code, need to examine definitions and uses of variables beyond basic blocks. With use- definition.

9. Optimization Marcus Denker. 2 © Marcus Denker Optimization Roadmap  Introduction  Optimizations in the Back-end  The Optimizer  SSA Optimizations.

EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

U NIVERSITY OF M ASSACHUSETTS, A MHERST D EPARTMENT OF C OMPUTER S CIENCE Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.

Common Sub-expression Elim Want to compute when an expression is available in a var Domain:

Improving Code Generation Honors Compilers April 16 th 2002.

EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.

Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.

Global optimization. Data flow analysis To generate better code, need to examine definitions and uses of variables beyond basic blocks. With use- definition.

Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.

Dataflow IV: Loop Optimizations, Exam 2 Review EECS 483 – Lecture 26 University of Michigan Wednesday, December 6, 2006.

U NIVERSITY OF M ASSACHUSETTS, A MHERST D EPARTMENT OF C OMPUTER S CIENCE Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.

What’s in an optimizing compiler?

1 Code Generation Part II Chapter 8 (1 st ed. Ch.9) COP5621 Compiler Construction Copyright Robert van Engelen, Florida State University,

1 Code Generation Part II Chapter 9 COP5621 Compiler Construction Copyright Robert van Engelen, Florida State University, 2005.

Predicated Static Single Assignment (PSSA) Presented by AbdulAziz Al-Shammari

1 Code optimization “Code optimization refers to the techniques used by the compiler to improve the execution efficiency of the generated object code”

High-Level Transformations for Embedded Computing

EECS 583 – Class 11 Instruction Scheduling University of Michigan October 12, 2011.

EECS 583 – Class 8 Classic Optimization University of Michigan October 3, 2011.

3/2/2016© Hal Perkins & UW CSES-1 CSE P 501 – Compilers Optimizing Transformations Hal Perkins Autumn 2009.

CS412/413 Introduction to Compilers and Translators April 2, 1999 Lecture 24: Introduction to Optimization.

EECS 583 – Class 10 ILP Optimization and Intro. to Code Generation University of Michigan October 8, 2012.

Code Optimization Code produced by compilation algorithms can often be improved (ideally optimized) in terms of run-time speed and the amount of memory.

Code Optimization Overview and Examples

High-level optimization Jakub Yaghob

Code Optimization.

Static Single Assignment

Optimization Code Optimization ©SoftMoore Consulting.

Princeton University Spring 2016

Princeton University Spring 2016

Code Generation Part III

Optimizing Transformations Hal Perkins Autumn 2011

A Practical Stride Prefetching Implementation in Global Optimizer

Optimizing Transformations Hal Perkins Winter 2008

Code Generation Part III

Instruction Level Parallelism (ILP)

EECS 583 – Class 8 Classic Optimization

Samuel Larsen and Saman Amarasinghe, MIT CSAIL

EECS 583 – Class 10 Finish Optimization Intro. to Code Generation

EECS 583 – Class 8 Finish SSA, Start on Classic Optimization

EECS 583 – Class 12 Superblock Scheduling Intro. to Loop Scheduling

EECS 583 – Class 9 Classic and ILP Optimization

Code Generation Part II

EECS 583 – Class 12 Superblock Scheduling

CSE P 501 – Compilers SSA Hal Perkins Autumn /31/2019

Code Optimization.

Presentation transcript:

EECS 583 – Class 9 Classic and ILP Optimization University of Michigan October 5, 2011

- 1 - Announcements & Reading Material v Look on Phorum for help with HW #2 »Daya’s post on adding/deleting instructions »Splitting a BB v Today’s class »“Compiler Code Transformations for Superscalar-Based High- Performance Systems,” Scott Mahlke, William Chen, John Gyllenhaal, Wen-mei Hwu, Pohua Chang, and Tokuzo Kiyohara, Proceedings of Supercomputing '92, Nov. 1992, pp v Next class (instruction scheduling) »“Machine Description Driven Compilers for EPIC Processors”, B. Rau, V. Kathail, and S. Aditya, HP Technical Report, HPL , (long paper but informative)

- 2 - HW#2 – Best Times from Last Semester v Times in seconds for each application » no spec LICM with spec LICM »Perf »Perf »Perf »583wc v This translates to: 89x, 178x and 125x and 1.7x speedup v Note: These generated with an older LLVM »But, gives you an idea of what to expect.

- 3 - Course Project – Time to Start Thinking About This v Mission statement: Design and implement something “interesting” in a compiler »LLVM preferred, but others are fine »Groups of 2-3 people (1 or 4 persons is possible in some cases) »Extend existing research paper or go out on your own v Topic areas »Automatic parallelization/SIMDization »Optimizing for GPUs »Targeting VLIW processors »Memory system optimization »Reliability »Energy »For the adventurous  Dynamic optimization  Streaming applications  Binary optimization

- 4 - Course Projects – Timetable v Now »Start thinking about potential topics, identify teammates v Oct 24-28: Project proposals »No class that week »Daya/I will meet with each group »Informal proposal discussed at meeting »Written proposal (a paragraph or 2 plus some references) due Oct 31 v Last 2 wks of Nov: Project checkpoint »Class as usual »Daya/I will again meet with each group »Update on your progress, what left to do v Dec 13-16: Project demos

- 5 - Sample Project Ideas (in random order) v Memory system »Cache profiler for LLVM IR – miss rates, stride determination »Data cache prefetching »Data layout for improved cache behavior v Optimizing for GPUs »Dumb OpenCL/CUDA  smart OpenCL/CUDA – selection of threads/blocks and managing on-chip memory »Reducing uncoalesced memory accesses – measurement of uncoalesced accesses, code restructuring to reduce these »Matlab  CUDA/OpenCL »StreamIt to CUDA/OpenCL (single or multiple GPUs) v Reliability »AVF profiling, reducing vulnerability to soft errors »Coarse-grain code duplication for soft error protection

- 6 - More Project Ideas v Parallelization/SIMDization »CUDA/OpenCL/Matlab/OpenMP to x86+SSE or Arm+NEON »DOALL loop parallelization, dependence breaking transformations »Stream parallelism analysis tool - profiling tool to identify maximal program regions the exhibit stream (or DOALL) parallelism »Access-execute program decomposition - decompose program into memory access/compute threads v Dynamic optimization (Dynamo, Dalvik VM) »Run-time DOALL loop parallelization »Run-time program analysis for security (taint tracking) »Run-time classic optimization (if they don’t have it already!) v Profiling tools »Distributed control flow/energy profiling (profiling + aggregation)

- 7 - And More Project Ideas v VLIW »Superblock formation in LLVM »Acyclic BB scheduler for LLVM »Modulo BB scheduler for LLVM v Binary optimizer »Arm binary to LLVM IR, de-register allocation »X86 binary (clean) to LLVM IR v Energy »Minimizing instruction bit flips v Misc »Program distillation - create a subset program with equivalent memory/branch behavior »Code sharing analysis for mobile applications

- 8 - From Last Time: Class Problem Solution r1 = 0 r2 = 10 r3 = 0 r4 = 1 r7 = r1 * 4 r6 = 8 if (r3 > 0) store (r1, r3) Optimize this applying 1. constant propagation 2. constant folding 3. Dead code elim r2 = 0 r6 = r6 * r7 r3 = r2 / r6 r3 = r4 r3 = r3 + r2 r1 = r6 r2 = r2 + 1 r1 = r1 + 1 if (r1 < 100) r1 = 0 r2 = 10 r3 = 0 r7 = r1 * 4 if (r3 > 0) store (r1, r3) r2 = 0 r3 = 0 r3 = 1 + r2 r1 = 8 r2 = r2 + 1 r1 = r1 + 1 if (r1 < 100)

- 9 - From Last Time: Class Problem Solution Optimize this applying 1. dead code elimination 2. forward copy propagation 3. CSE r4 = r1 r6 = r15 r2 = r3 * r4 r8 = r2 + r5 r9 = r3 r7 = load(r2) if (r2 > r8) r5 = r9 * r4 r11 = r2 r12 = load(r11) if (r12 != 0) r3 = load(r2) r10 = r3 / r6 r11 = r8 store (r11, r7) store (r12, r3) r2 = r3 * r1 r8 = r2 + r5 r7 = load(r2) if (r2 > r8) if (r7 != 0) r3 = r7 store (r8, r7) store (r12, r3)

Loop Invariant Code Motion (LICM) v Move operations whose source operands do not change within the loop to the loop preheader »Execute them only 1x per invocation of the loop »Be careful with memory operations! »Be careful with ops not executed every iteration r1 = 3 r5 = &A r4 = load(r5) r7 = r4 * 3 r8 = r2 + 1 r7 = r8 * r4 r3 = r2 + 1 r1 = r1 + r7 store (r1, r3)

LICM (2) v Rules »X can be moved »src(X) not modified in loop body »X is the only op to modify dest(X) »for all uses of dest(X), X is in the available defs set »for all exit BB, if dest(X) is live on the exit edge, X is in the available defs set on the edge »if X not executed on every iteration, then X must provably not cause exceptions »if X is a load or store, then there are no writes to address(X) in loop r1 = 3 r5 = &A r4 = load(r5) r7 = r4 * 3 r8 = r2 + 1 r7 = r8 * r4 r3 = r2 + 1 r1 = r1 + r7 store (r1, r3)

Global Variable Migration v Assign a global variable temporarily to a register for the duration of the loop »Load in preheader »Store at exit points v Rules »X is a load or store »address(X) not modified in the loop »if X not executed on every iteration, then X must provably not cause an exception »All memory ops in loop whose address can equal address(X) must always have the same address as X r4 = load(r5) r4 = r4 + 1 r8 = load(r5) r7 = r8 * r4 store(r5, r4) store(r5,r7)

Induction Variable Strength Reduction v Create basic induction variables from derived induction variables v Induction variable »BIV (i++)  0,1,2,3,4,... »DIV (j = i * 4)  0, 4, 8, 12, 16,... »DIV can be converted into a BIV that is incremented by 4 v Issues »Initial and increment vals »Where to place increments r5 = r4 - 3 r4 = r4 + 1 r7 = r4 * r9 r6 = r4 << 2

Induction Variable Strength Reduction (2) v Rules »X is a *, <<, + or – operation »src1(X) is a basic ind var »src2(X) is invariant »No other ops modify dest(X) »dest(X) != src(X) for all srcs »dest(X) is a register v Transformation »Insert the following into the preheader  new_reg = RHS(X) »If opcode(X) is not add/sub, insert to the bottom of the preheader  new_inc = inc(src1(X)) opcode(X) src2(X) »else  new_inc = inc(src1(X)) »Insert the following at each update of src1(X)  new_reg += new_inc »Change X  dest(X) = new_reg r5 = r4 - 3 r4 = r4 + 1 r7 = r4 * r9 r6 = r4 << 2

Class Problem Optimize this applying induction var str reduction r5 = r5 + 1 r11 = r5 * 2 r10 = r r12 = load (r10+0) r9 = r1 << 1 r4 = r r3 = load(r4+4) r3 = r3 + 1 store(r4+0, r3) r7 = r3 << 2 r6 = load(r7+0) r13 = r2 - 1 r1 = r1 + 1 r2 = r2 + 1 r1 = 0 r2 = 0 r13, r12, r6, r10 liveout

ILP Optimization v Traditional optimizations »Redundancy elimination »Reducing operation count v ILP (instruction-level parallelism) optimizations »Increase the amount of parallelism and the ability to overlap operations »Operation count is secondary, often trade parallelism for extra instructions (avoid code explosion) v ILP increased by breaking dependences »True or flow = read after write dependence »False or (anti/output) = write after read, write after write

a: r1 = r2 + r3 b: r13 = r4 + r5 c: r11 = r7 * r8 d: r17 = r11 + r5 e: r21 = r f: r14 = r Register Renaming v Similar goal to SSA construction, but simpler v Remove dependences caused by variable re-use »Re-use of source variables »Re-use of temporaries »Anti, output dependences v Create a new variable to hold each unique life time v Very simple transformation with straight-line code »Make each def a unique register »Substitute new name into subsequent uses a: r1 = r2 + r3 b: r3 = r4 + r5 c: r1 = r7 * r8 d: r7 = r1 + r5 e: r1 = r3 + 4 f: r4 = r7 + 4

Global Register Renaming v Straight-line code strategy does not work »A single use may have multiple reaching defs v Web = Collection of defs/uses which have possible value flow between them »Identify webs  Take a def, add all uses  Take all uses, add all reaching defs  Take all defs, add all uses  repeat until stable soln »Each web renamed if name is the same as another web x = y = = y = x x = y = = y= x y = = y

Back Substitution v Generation of expressions by compiler frontends is very sequential »Account for operator precedence »Apply left-to-right within same precedence v Back substitution »Create larger expressions  Iteratively substitute RHS expression for LHS variable »Note – may correspond to multiple source statements »Enable subsequent optis v Optimization »Re-compute expression in a more favorable manner r9 = r1 + r2 r10 = r9 + r3 r11 = r10 - r4 r12 = r11 + r5 r13 = r12 – r6 Subs r12: r13 = r11 + r5 – r6 Subs r11: r13 = r10 – r4 + r5 – r6 Subs r10 r13 = r9 + r3 – r4 + r5 – r6 Subs r9 r13 = r1 + r2 + r3 – r4 + r5 – r6 y = a + b + c – d + e – f;

Tree Height Reduction v Re-compute expression as a balanced binary tree »Obey precedence rules »Essentially re-parenthesize »Combine literals if possible v Effects »Height reduced (n terms)  n-1 (assuming unit latency)  ceil(log2(n)) »Number of operations remains constant »Cost  Temporary registers “live” longer »Watch out for  Always ok for integer arithmetic  Floating-point – may not be!! r9 = r1 + r2 r10 = r9 + r3 r11 = r10 - r4 r12 = r11 + r5 r13 = r12 – r6 r13 = r1 + r2 + r3 – r4 + r5 – r6 r1 + r2r3 – r4r5 – r6 + + t1 = r1 + r2 t2 = r3 – r4 t3 = r5 – r6 t4 = t1 + t2 r13 = t4 + t3 r13 after back subs: original: final code:

Class Problem Assume: + = 1, * = 3 0 r1 0 r2 0 r3 1 r4 2 r5 0 r6 operand arrival times r10 = r1 * r2 r11 = r10 + r3 r12 = r11 + r4 r13 = r12 – r5 r14 = r13 + r6 Back susbstitute Re-express in tree-height reduced form Account for latency and arrival times

Optimizing Unrolled Loops r1 = load(r2) r3 = load(r4) r5 = r1 * r3 r6 = r6 + r5 r2 = r2 + 4 r4 = r4 + 4 if (r4 < 400) goto loop loop: r1 = load(r2) r3 = load(r4) r5 = r1 * r3 r6 = r6 + r5 r2 = r2 + 4 r4 = r4 + 4 r1 = load(r2) r3 = load(r4) r5 = r1 * r3 r6 = r6 + r5 r2 = r2 + 4 r4 = r4 + 4 r1 = load(r2) r3 = load(r4) r5 = r1 * r3 r6 = r6 + r5 r2 = r2 + 4 r4 = r4 + 4 if (r4 < 400) goto loop iter1 iter2 iter3 Unroll = replicate loop body n-1 times. Hope to enable overlap of operation execution from different iterations Not possible! loop: unroll 3 times

Register Renaming on Unrolled Loop r1 = load(r2) r3 = load(r4) r5 = r1 * r3 r6 = r6 + r5 r2 = r2 + 4 r4 = r4 + 4 r1 = load(r2) r3 = load(r4) r5 = r1 * r3 r6 = r6 + r5 r2 = r2 + 4 r4 = r4 + 4 r1 = load(r2) r3 = load(r4) r5 = r1 * r3 r6 = r6 + r5 r2 = r2 + 4 r4 = r4 + 4 if (r4 < 400) goto loop iter1 iter2 iter3 loop: r1 = load(r2) r3 = load(r4) r5 = r1 * r3 r6 = r6 + r5 r2 = r2 + 4 r4 = r4 + 4 r11 = load(r2) r13 = load(r4) r15 = r11 * r13 r6 = r6 + r15 r2 = r2 + 4 r4 = r4 + 4 r21 = load(r2) r23 = load(r4) r25 = r21 * r23 r6 = r6 + r25 r2 = r2 + 4 r4 = r4 + 4 if (r4 < 400) goto loop iter1 iter2 iter3 loop:

Register Renaming is Not Enough! v Still not much overlap possible v Problems »r2, r4, r6 sequentialize the iterations »Need to rename these v 2 specialized renaming optis »Accumulator variable expansion (r6) »Induction variable expansion (r2, r4) r1 = load(r2) r3 = load(r4) r5 = r1 * r3 r6 = r6 + r5 r2 = r2 + 4 r4 = r4 + 4 r11 = load(r2) r13 = load(r4) r15 = r11 * r13 r6 = r6 + r15 r2 = r2 + 4 r4 = r4 + 4 r21 = load(r2) r23 = load(r4) r25 = r21 * r23 r6 = r6 + r25 r2 = r2 + 4 r4 = r4 + 4 if (r4 < 400) goto loop iter1 iter2 iter3 loop:

Accumulator Variable Expansion v Accumulator variable »x = x + y or x = x – y »where y is loop variant!! v Create n-1 temporary accumulators v Each iteration targets a different accumulator v Sum up the accumulator variables at the end v May not be safe for floating- point values r1 = load(r2) r3 = load(r4) r5 = r1 * r3 r6 = r6 + r5 r2 = r2 + 4 r4 = r4 + 4 r11 = load(r2) r13 = load(r4) r15 = r11 * r13 r16 = r16 + r15 r2 = r2 + 4 r4 = r4 + 4 r21 = load(r2) r23 = load(r4) r25 = r21 * r23 r26 = r26 + r25 r2 = r2 + 4 r4 = r4 + 4 if (r4 < 400) goto loop iter1 iter2 iter3 loop: r16 = r26 = 0 r6 = r6 + r16 + r26

Induction Variable Expansion v Induction variable »x = x + y or x = x – y »where y is loop invariant!! v Create n-1 additional induction variables v Each iteration uses and modifies a different induction variable v Initialize induction variables to init, init+step, init+2*step, etc. v Step increased to n*original step v Now iterations are completely independent !! r1 = load(r2) r3 = load(r4) r5 = r1 * r3 r6 = r6 + r5 r2 = r r4 = r r11 = load(r12) r13 = load(r14) r15 = r11 * r13 r16 = r16 + r15 r12 = r r14 = r r21 = load(r22) r23 = load(r24) r25 = r21 * r23 r26 = r26 + r25 r22 = r r24 = r if (r4 < 400) goto loop iter1 iter2 iter3 loop: r16 = r26 = 0 r6 = r6 + r16 + r26 r12 = r2 + 4, r22 = r2 + 8 r14 = r4 + 4, r24 = r4 + 8

Better Induction Variable Expansion v With base+displacement addressing, often don’t need additional induction variables »Just change offsets in each iterations to reflect step »Change final increments to n * original step r1 = load(r2) r3 = load(r4) r5 = r1 * r3 r6 = r6 + r5 r11 = load(r2+4) r13 = load(r4+4) r15 = r11 * r13 r16 = r16 + r15 r21 = load(r2+8) r23 = load(r4+8) r25 = r21 * r23 r26 = r26 + r25 r2 = r r4 = r if (r4 < 400) goto loop iter1 iter2 iter3 loop: r16 = r26 = 0 r6 = r6 + r16 + r26

Homework Problem r1 = load(r2) r5 = r6 + 3 r6 = r5 + r1 r2 = r2 + 4 if (r2 < 400) goto loop loop: r1 = load(r2) r5 = r6 + 3 r6 = r5 + r1 r2 = r2 + 4 r1 = load(r2) r5 = r6 + 3 r6 = r5 + r1 r2 = r2 + 4 r1 = load(r2) r5 = r6 + 3 r6 = r5 + r1 r2 = r2 + 4 if (r2 < 400) goto loop loop: Optimize the unrolled loop Renaming Tree height reduction Ind/Acc expansion