High-Level Transformations for Embedded Computing

Slides:



Advertisements
Similar presentations
CSC 4181 Compiler Construction Code Generation & Optimization.
Advertisements

Code Optimization and Performance Chapter 5 CS 105 Tour of the Black Holes of Computing.
Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.
Optimizing Compilers for Modern Architectures Syllabus Allen and Kennedy, Preface Optimizing Compilers for Modern Architectures.
Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.
Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
Synopsys University Courseware Copyright © 2012 Synopsys, Inc. All rights reserved. Compiler Optimization and Code Generation Lecture - 3 Developed By:
1 Optimizing compilers Managing Cache Bercovici Sivan.
Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
7. Optimization Prof. O. Nierstrasz Lecture notes by Marcus Denker.
Lecture 11: Code Optimization CS 540 George Mason University.
Chapter 9 Code optimization Section 0 overview 1.Position of code optimizer 2.Purpose of code optimizer to get better efficiency –Run faster –Take less.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto
Computer Architecture Lecture 7 Compiler Considerations and Optimizations.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
Chapter 10 Code Optimization. A main goal is to achieve a better performance Front End Code Gen Intermediate Code source Code target Code user Machine-
1 Code Optimization Code produced by compilation algorithms can often be improved (ideally optimized) in terms of run-time speed and the amount of memory.
Compiler Challenges for High Performance Architectures
The Last Lecture Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 512 at Rice University have explicit permission.
Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.
CS 536 Spring Intermediate Code. Local Optimizations. Lecture 22.
EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
9. Optimization Marcus Denker. 2 © Marcus Denker Optimization Roadmap  Introduction  Optimizations in the Back-end  The Optimizer  SSA Optimizations.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
Chapter 2 Instruction-Level Parallelism and Its Exploitation
Introduction to Program Optimizations Chapter 11 Mooly Sagiv.
Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.
CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral
U NIVERSITY OF M ASSACHUSETTS, A MHERST D EPARTMENT OF C OMPUTER S CIENCE Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.
Parallelizing Compilers Presented by Yiwei Zhang.
1 Copy Propagation What does it mean? – Given an assignment x = y, replace later uses of x with uses of y, provided there are no intervening assignments.
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
Machine-Independent Optimizations Ⅰ CS308 Compiler Theory1.
PSUCS322 HM 1 Languages and Compiler Design II IR Code Optimization Material provided by Prof. Jingke Li Stolen with pride and modified by Herb Mayer PSU.
Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
Introduction For some compiler, the intermediate code is a pseudo code of a virtual machine. Interpreter of the virtual machine is invoked to execute the.
What’s in an optimizing compiler?
RISC architecture and instruction Level Parallelism (ILP) based on “Computer Architecture: a Quantitative Approach” by Hennessy and Patterson, Morgan Kaufmann.
1 Code optimization “Code optimization refers to the techniques used by the compiler to improve the execution efficiency of the generated object code”
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2011 Dependence Analysis and Loop Transformations.
Compiler Principles Fall Compiler Principles Lecture 0: Local Optimizations Roman Manevich Ben-Gurion University.
ECE 1754 Loop Transformations by: Eric LaForest
Compiler Optimizations ECE 454 Computer Systems Programming Topics: The Role of the Compiler Common Compiler (Automatic) Code Optimizations Cristiana Amza.
3/2/2016© Hal Perkins & UW CSES-1 CSE P 501 – Compilers Optimizing Transformations Hal Perkins Autumn 2009.
Memory-Aware Compilation Philip Sweany 10/20/2011.
CS412/413 Introduction to Compilers and Translators April 2, 1999 Lecture 24: Introduction to Optimization.
VISUAL C++ PROGRAMMING: CONCEPTS AND PROJECTS Chapter 7A Arrays (Concepts)
©SoftMoore ConsultingSlide 1 Code Optimization. ©SoftMoore ConsultingSlide 2 Code Optimization Code generation techniques and transformations that result.
Automatic Thread Extraction with Decoupled Software Pipelining
Code Optimization Code produced by compilation algorithms can often be improved (ideally optimized) in terms of run-time speed and the amount of memory.
Code Optimization Overview and Examples
Code Optimization.
Simone Campanoni Loop transformations Simone Campanoni
Loop Restructuring Loop unswitching Loop peeling Loop fusion
Optimization Code Optimization ©SoftMoore Consulting.
Presented by: Huston Bokinsky Ying Zhang 25 April, 2013
Optimizing Transformations Hal Perkins Autumn 2011
Optimizing Transformations Hal Perkins Winter 2008
Code Optimization Overview and Examples Control Flow Graph
Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.
Optimization 薛智文 (textbook ch# 9) 薛智文 96 Spring.
Lecture 19: Code Optimisation
Introduction to Optimization
Code Optimization.
Presentation transcript:

High-Level Transformations for Embedded Computing

Organization of a hypothetical optimizing compiler

DEPENDENCE ANALYSIS Dependence analysis identifies these constraints, which are then used to determine whether a particular transformation can be applied without changing the semantics of the computation. A dependence is a relationship between two computations that places constraints on their execution order. Types dependences: (i) control dependence and (ii) data dependence. Control Dependence Two statements have a Data Dependence if they cannot be executed simultaneously due to conflicting uses of the same variable.

Types of Data Dependences (1) Flow dependence (also called true dependence) S1: a = c*10 S2: d = 2*a + c Anti-dependence S1: e = f*4 + g S2: g = 2*h

Types of Data Dependences (2) Output dependence  both statements write the same variable S1: a = b*c S2: a = d+e Input Dependence  when two accesses to the same location memory are both reads Dependence Graph  nodes represents statements and arcs dependencies between computations

Loop Dependence Analysis To compute dependence information for loops, the key problem is understanding the use of arrays; scalar variables are relatively easy to manage. To track array behavior, the compiler must analyze the subscript expressions in each array reference. To discover whether there is a dependence in the loop nest, it is sufficient to determine whether any of the iterations can write a value that is read or written by any of the other iterations.

TRANSFORMATIONS Data-Flow Based Loop Transformations Loop Reordering Loop Restructuring Loop Replacement Transformations Memory Access Transformations Partial Evaluation Redundancy Elimination Procedure Call Transformations

Data-Flow Based Loop Transformations (1) A number of classical loop optimizations are based on data-flow analysis, which tracks the flow of data through a program's variables Loop-based Strength Reduction Reduction in strength replaces an expression in a loop with one that is equivalent but uses a less expensive operator Common use in induction variables expressions

Data-Flow Based Loop Transformations (2) Loop-invariant Code Motion When a computation appears inside a loop but its result does not change between iterations, the compiler can move that computation outside the loop Use for expensive operator

Data-Flow Based Loop Transformations (3) Loop Unswitching is applied when a loop contains a conditional with a loop-invariant test condition. The loop is then replicated inside each branch of the conditional, saving the overhead of conditional branching inside the loop, reducing the code size of the loop body, and possibly enabling the parallelization of a branch of the conditional

Loop Reordering Transformations Change the relative order of execution of the iterations of a loop nest or nests. Expose parallelism and improve memory locality.

Loop Reordering Transformations (1) Loop Interchange enable vectorization by interchanging an inner, dependent loop with an outer, independent loop; improve vectorization by moving the independent loop with the largest range into the innermost position; improve parallel performance by moving an independent loop outwards in a loop nest to increase the granularity of each iteration and reduce the number of barrier synchronizations; reduce stride, ideally to stride 1; and increase the number of loop-invariant expressions in the inner loop.

Loop Reordering Transformations (2) Loop Skewing  skew iterations execution Useful for Loop Interchange Skewing factor “i” Parallel iterations

Loop Reordering Transformations (3) Loop Reversal Reversal changes the direction in which the loop traverses its iteration range. It is often used in conjunction with other iteration space reordering transformations because it changes the dependence vectors

Loop Reordering Transformations (4) Strip Mining  execute a specific number of iterations in parallel fashion Strip mining is a method of adjusting the granularity of an operation, especially a parallelizable operation 64 parallel iterations

Loop Reordering Transformations (5) Tiling is the multi-dimensional generalization of strip-mining. Tiling (also called blocking) is primarily used to improve cache reuse (QC) by dividing an iteration space into tiles and transforming the loop nest to iterate over them

Loop Reordering Transformations (6) Loop Distribution  (also called loop fission or loop splitting) breaks a single loop into many. It is used to: Create perfect loop nests; Create sub-loops with fewer dependences; Improve instruction cache and instruction TLB locality due to shorter loop bodies; Reduce memory requirements by iterating over fewer arrays; and Increase register re-use by decreasing register pressure.

Loop Reordering Transformations (7) Loop Fusion (loop merging) It can improve performance by: reducing loop overhead; increasing instruction parallelism; improving register, vector, data cache, TLB, or page locality Improving the load balance of parallel loops

Loop Restructuring Transformations Loop Restructuring transformations that change the structure of the loop, but leave the computations performed by an iteration of the loop body and their relative order unchanged.

Loop Restructuring Transformations (1) Loop Unrolling replicates the body of a loop some number of times called the unrolling factor (u) and iterates by step u instead of step 1. It is a fundamental technique for generating the long instruction sequences required by VLIW machines. Unrolling can improve the performance by: Reducing loop overhead; Increasing instruction parallelism; and Improving register, data cache, or TLB locality.

Loop Restructuring Transformations (2) Software Pipelining  improve instruction parallelism is software pipelining In software pipelining, the operations of a single loop iteration are broken into s stages, and a single iteration performs stage 1 from iteration i, stage 2 from iteration i-1, etc. Startup code must be generated before the loop to initialize the pipeline for the last s-1 iterations

Loop unrolling vs. software pipelining The difference between unrolling and software pipelining: unrolling reduces overhead, while pipelining reduces the startup cost of each iteration.

Loop Restructuring Transformations (3) Loop Coalescing combines a loop nest into a single loop, with the original indices computed from the resulting single induction variable

Loop Restructuring Transformations (4) Loop Collapsing is a simpler, more efficient, but less general version of coalescing in which the number of dimensions of the array is actually reduced. Collapsing eliminates the overhead of multiple nested loops and multi-dimensional array indexing.

Loop Restructuring Transformations (5) Loop Peeling a small number of iterations are removed from the beginning or end of the loop and executed separately. Peeling has two uses: for removing dependences created by the first or last few loop iterations, thereby enabling parallelization; and for matching the iteration control of adjacent loops to enable fusion.

Loop Replacement Transformations Transformations that operate on whole loops and completely alter their structure. Reduction Recognition: A reduction is an operation that computes a scalar value from an array. Common reductions include computing either the sum or the maximum value of the elements in an array.

Loop Replacement Transformations (2) Array Statement Scalarization When a loop is expressed in array notation, the compiler can either convert it into vector operations or scalarize it into one or more serial loops. However, the conversion is not completely straightforward because array notation requires that the operation be performed as if every value on the right-hand side and every sub-expression on the left-hand side were computed before any assignments are performed.

Memory Access Transformations Different speeds between CPU and DRAM Factors affecting memory performance include: Re-use, denoted by Q and QC, the ratio of uses of an item to the number of times it is loaded; Parallelism. Vector machines often divide memory into banks, allowing vector registers to be loaded in a parallel or pipelined fashion. Working Set Size. If all the memory elements accessed inside of a loop do not fitin the data cache, then items that will be accessed in later iterations may be flushed, decreasing QC. Memory system performance can be improve using: loop interchange (6.2.1), loop tiling (6.2.6), loop unrolling (6.3.1), loop fusion (6.2.8), and various optimizations that eliminate register saves at procedure calls (6.8).

Memory Access Transformations (1) Array Padding is a transformation whereby unused data locations are inserted between the columns of an array or between arrays. Padding is used to ameliorate a number of memory system conflicts, in particular: Bank conflicts on vector machines with banked memory Cache set or TLB set conflicts Cache misses False sharing of cache lines on shared-memory multiprocessors lines loaded by the earlier references, precluding re-use.

Memory Access Transformations (2) Scalar Expansion Loops often contain variables that are used as temporaries within the loop body. Such variables will create an anti-dependence from one iteration to the next, and will have no other loop-carried dependences. Allocating one temporary for each iteration removes the dependence and makes the loop a candidate for parallelization Scalar expansion can also increase instruction-level parallelism by removing dependences.

Partial Evaluation Partial evaluation refers to the general technique of performing part of a computation at compile time. Constant propagation is one of the most important optimizations that a compiler can perform and a good optimizing compiler will apply it aggressively. Programs typically contain many constants; by propagating them through the program, the compiler can do a significant amount of pre-computation. The propagation reveals many opportunities for other optimizations. Constant folding is a companion to constant propagation: when an expression contains an operation with constant values as operands, the compiler can replace the expression with the result.

Partial Evaluation (1) Forward substitution is a generalization of copy propagation. The use of a variable is replaced by its defining expression, which must be live at that point. Use parallel reduction optimization

Partial Evaluation (2) Strength Reduction: replace an expensive operator with an equivalent less expensive operator.

Redundancy Elimination Optimizations to improve performance by identifying redundant computations and removing them. Redundancy-eliminating transformations remove two kinds of computations: those that are unreachable and those that are useless. A computation is unreachable if it is never executed; removing it from the program will have no semantic effect on the instructions executed. A computation is useless if none of the outputs of the program are dependent on it.

Redundancy Elimination (1) Unreachable Code Elimination Useless Code Elimination Dead Variable Elimination Common Sub-expression Elimination

Procedure Call Transformations The optimizations attempt to reduce the overhead of procedure calls in one of four ways: eliminating the call entirely; eliminating execution of the called procedure's body; eliminating some of the entry/exit overhead; and avoiding some steps in making a procedure call when the behavior of the called procedure is known or can be altered.

Procedure Call Transformations (1) Procedure Inlining replaces a procedure call with a copy of the body of the called procedure When a call is inlined, all the overhead for the invocation is eliminated. After the call is inlined, the compiler may be able to prove loop independence,thereby allowing vectorization or parallelization. Inlining also affects the instruction cache behavior of the program (b) foo after parameter promotion on max

Procedure Call Transformations (2)