Loop Optimizations Scheduling. Loop fusion int acc = 0; for (int i = 0; i < n; ++i) { acc += a[i]; a[i] = acc; } for (int i = 0; i < n; ++i) { b[i] +=

Slides:

Advertisements

Similar presentations

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

Advertisements

Instruction-Level Parallel Processors {Objective: executing two or more instructions in parallel} 4.1 Evolution and overview of ILP-processors 4.2 Dependencies.

VLIW Very Large Instruction Word. Introduction Very Long Instruction Word is a concept for processing technology that dates back to the early 1980s. The.

ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto

Compiler techniques for exposing ILP

Lec ECE 463/521, Profs. Conte, Rotenberg and Gehringer, Dept. of ECE, NC State University Static Scheduling Techniques m Local scheduling (within.

Computer Architecture Lecture 7 Compiler Considerations and Optimizations.

1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.

Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.

CS252 Graduate Computer Architecture Spring 2014 Lecture 9: VLIW Architectures Krste Asanovic

EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

Chapter 4 Advanced Pipelining and Intruction-Level Parallelism Computer Architecture A Quantitative Approach John L Hennessy & David A Patterson 2 nd Edition,

Intro to Computer Org. Pipelining, Part 2 – Data hazards + Stalls.

EECC551 - Shaaban #1 Fall 2005 lec# Static Compiler Optimization Techniques We examined the following static ISA/compiler techniques aimed.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Software Methods to Increase Data Cache Performance Presented by Philip Marshall.

9. Code Scheduling for ILP-Processors TECH Computer Science {Software! compilers optimizing code for ILP-processors, including VLIW} 9.1 Introduction 9.2.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Compiler Challenges for High Performance Architectures

Fall 2011SYSC 5704: Elements of Computer Systems 1 SYSC 5704 Elements of Computer Systems Optimization to take advantage of hardware.

Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.

EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

EECC551 - Shaaban #1 Spring 2004 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.

EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Winter 2003 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.

Chapter 2 Instruction-Level Parallelism and Its Exploitation

CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral

EECC551 - Shaaban #1 Winter 2011 lec# Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.

1 Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as a i,j and elements of.

EECC551 - Shaaban #1 Winter 2002 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim.

Fall 2002 Lecture 14: Instruction Scheduling. Saman Amarasinghe ©MIT Fall 1998 Outline Modern architectures Branch delay slots Introduction to.

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2011 Dependence Analysis and Loop Transformations.

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.

Carnegie Mellon Lecture 15 Loop Transformations Chapter Dror E. MaydanCS243: Loop Optimization and Array Analysis1.

Exploiting Parallelism

Memory-Aware Compilation Philip Sweany 10/20/2011.

Carnegie Mellon Lecture 8 Software Pipelining I. Introduction II. Problem Formulation III. Algorithm Reading: Chapter 10.5 – 10.6 M. LamCS243: Software.

Optimization. How to Optimize Code Conventional Wisdom: 1.Don't do it 2.(For experts only) Don't do it yet.

Data Parallel Computations and Pattern ITCS 4/5145 Parallel computing, UNC-Charlotte, B. Wilkinson, slides6c.ppt Nov 4, c.1.

A few words on locality and arrays

Lecture 38: Compiling for Modern Architectures 03 May 02

Compiler Techniques for ILP

Code Optimization Overview and Examples

Loop Restructuring Loop unswitching Loop peeling Loop fusion

CSL718 : VLIW - Software Driven ILP

Register Pressure Guided Unroll-and-Jam

Compiler techniques for exposing ILP (cont)

Code Optimization Overview and Examples Control Flow Graph

Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Instruction Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Loop Optimization “Programs spend 90% of time in loops”

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

How to improve (decrease) CPI

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Loop-Level Parallelism

Static Scheduling Techniques

Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as ai,j and elements of B as.

Introduction to Optimization

Optimizing single thread performance

Presentation transcript:

Loop Optimizations Scheduling

Loop fusion int acc = 0; for (int i = 0; i < n; ++i) { acc += a[i]; a[i] = acc; } for (int i = 0; i < n; ++i) { b[i] += a[i]; } Int acc = 0; for (int i = 0; i < n; ++i) { acc += a[i]; a[i] = acc; // Will DCE pick this up? b[i] += acc; }

Loop fission for (int I = 0; i < n; ++i) { a[i] = e1; b[i] = e2; // e1 and e2 independent } for (int I = 0; i < n; ++i) { a[i] = e1; } for (int I = 0; i < n; ++i) { b[i] = e2; }

Loop unrolling for (int i = 0; i < n; ++i) { a[i] = b[i] * 7 + c[i] / 13; } for (int i = 0; i < n % 3; ++i) { a[i] = b[i] * 7 + c[i] / 13; } for (; i < n; i += 3) { a[i] = b[i] * 7 + c[i] / 13; a[i + 1] = b[i + 1] * 7 + c[i + 1] / 13; a[i + 2] = b[i + 2] * 7 + c[i + 2] / 13; }

Loop interchange for (int i = 0; i < n; ++i) { for (int j = 0; j < n; ++j) { a[i][j] += 1; } } for (int j = 0; j < n; ++j) { for (int i = 0; i < n; ++i) { a[i][j] += 1; } }

Loop peeling for (int i = 0; i < n; ++i) { b[i] = (i == 0) ? a[i] : a[i] + b[i-1]; } b[0] = a[0]; for (int i = 1; i < n; ++i) { b[i] = a[i] + b[i-1]; }

Loop tiling for (int i = 0; i < n; ++i) { for (int j = 0; j < n; ++j) { for (int k = 0; k < n; ++k) { c[i][j] += a[i][k] * b[k][j]; } } } Very roughly: (need outer loops to move y and z) for (int i = y; i < y + 10; ++i) { for (int j = z; j < z + 10; ++j) { for (int k = 0; k < n; ++k) { c[i][j] += a[i][k] * b[k][j]; } } }

Loop parallelization for (int i = 0; i < n; ++i) { a[i] = b[i] + c[i]; // a, b, and c do not overlap } for (int i = 0; i < n % 4; ++i) a[i] = b[i] + c[i]; for (; i < n; i = i + 4) { __some4SIMDadd(a+i,b+i,c+i); }

Instruction scheduling An instruction goes through the processor pipeline in one or more cycles Several instructions can be processed simultaneously at different stages in the pipeline The number of cycles necessary to process an instruction is called its latency Examples of instruction latency on some x86 – ADD: 1 cycle – MUL: 4 cycles – DIV (32 bits): 40 cycles The simplest for of scheduling is done per CFG blocks

Instruction scheduling: example 1: ADD R1, R2 2: MUL R3, R4 3: 4: 5: 6: ADD R1, R3 1: MUL R3, R4 2: ADD R1, R2 3: 4: 5: ADD R1, R3

Beyond blocks: trace scheduling 1

Beyond blocks: trace scheduling 2

Pipelining for loops First idea: unrolling the loop and then scheduling – It works, but it is not always optimal, and increase the code size Think of a loop with the following body: – DIV R1, R3 ; ADD R1, R2 – We would have to unroll 40 times to hide the latency – And in general, it may not always be possible to hide the latency – What if the DIV was computing the value for 40 iterations from now? Software pipelining

Software pipelining 1 There is one last technique in the arsenal of the software optimizer that may be used to make most machines run at tip top speed. It can also lead to severe code bloat and may make for almost unreadable code, so should be considered the last refuge of the truly desperate. However, its performance characteristics are in many cases unmatched by any other approach, so we cover it here. It is called software pipelining [... ] Apple Developer Connection

Software pipelining 2

Symbolic evaluation Turning sequence of instructions back to expressions Hides some of the syntactic details Example: add a b c ; add d a b ; add e d a becomes a -> add(b,c) d -> add(add(b,c),b) e -> add(add(add(b,c),b),add(b,c)) For example, it is insensitive to the order of independent instructions

Is my pipeline correct? Let s(I) denotes the symbolic evaluation of the block of instructions I Let o be the composition of symbolic trees If s(P o E) = s(B  ) and s(S o E) = s(E o B) then the pipeline is correct (the converse is not true)