Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

Slides:

Advertisements

Similar presentations

Register Usage Keep as many values in registers as possible Register assignment Register allocation Popular techniques – Local vs. global – Graph coloring.

Advertisements

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.

CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

HW 2 is out! Due 9/25!. CS 6290 Static Exploitation of ILP.

1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)

ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:

Loop Unrolling & Predication CSE 820. Michigan State University Computer Science and Engineering Software Pipelining With software pipelining a reorganized.

CS 6461: Computer Architecture Basic Compiler Techniques for Exposing ILP Instructor: Morris Lancaster Corresponding to Hennessey and Patterson Fifth Edition.

Computer Architecture Lecture 7 Compiler Considerations and Optimizations.

Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.

Superscalar processors Review. Dependence graph S1S2 Nodes: instructions Edges: ordered relations among the instructions Any ordering-based transformation.

1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.

Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.

1 Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

Anshul Kumar, CSE IITD CS718 : VLIW - Software Driven ILP Introduction 23rd Mar, 2006.

EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

1 Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2)

1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

Rung-Bin Lin Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches4-1 Chapter 4 Exploiting Instruction-Level Parallelism with Software.

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

CS152 Lec15.1 Advanced Topics in Pipelining Loop Unrolling Super scalar and VLIW Dynamic scheduling.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)

Compiler Challenges for High Performance Architectures

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Computer Architecture.

1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.

1 CS 201 Compiler Construction Lecture 12 Global Register Allocation.

Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.

EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.

Chapter 2 Instruction-Level Parallelism and Its Exploitation

Chapter 13 Reduced Instruction Set Computers (RISC) Pipelining.

EECC551 - Shaaban #1 Winter 2011 lec# Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.

1 Lecture 6: Static ILP Topics: loop analysis, SW pipelining, predication, speculation (Section 2.2, Appendix G) Please hand in Assignment 1 now Assignment.

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim.

Fall 2002 Lecture 14: Instruction Scheduling. Saman Amarasinghe ©MIT Fall 1998 Outline Modern architectures Branch delay slots Introduction to.

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.

Instruction Level Parallelism Pipeline with data forwarding and accelerated branch Loop Unrolling Multiple Issue -- Multiple functional Units Static vs.

5/13/99 Ashish Sabharwal1 Pipelining and Hazards n Hazards occur because –Don’t have enough resources (ALU’s, memory,…) Structural Hazard –Need a value.

Register Usage Keep as many values in registers as possible Keep as many values in registers as possible Register assignment Register assignment Register.

1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.

Compiler Techniques for ILP

Instruction Scheduling Hal Perkins Summer 2004

Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

Instruction Scheduling Hal Perkins Winter 2008

Computer Architecture

Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2)

Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2) HW3 posted, due in a week.

Register Allocation Hal Perkins Summer 2004

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Instruction Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Instruction Rescheduling and Loop-Unroll

CSC3050 – Computer Architecture

Dynamic Hardware Prediction

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

How to improve (decrease) CPI

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Loop-Level Parallelism

Lecture 5: Pipeline Wrap-up, Static ILP

Introduction to Optimization

Presentation transcript:

Compiler Support for Superscalar Processors

Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result can be used: –FP-ALU – FP-ALU 3 –FP-ALU – Store 2 –Load – FP-ALU 1 –Load – Store 0 Jumps have one empty cylce Independent operations are important for efficient usage of the pipeline Loop unrolling is a very important technique.

Example For (i=1000; i>0; i=i-1) x[i]=x[i]+s Loop: load f0, 0(r1) ; f0=x[i] add f4,f0,f2 ; x[i]+s store f4, 0(r1); x[i]= addi r1,r1, -8 ; bne r1,r2,Loop; Branch r1!=r2 Compiler Loop: load f0, 0(r1) ; 1 stall ; 2 add f4,f0,f2 ; 3 stall ; 4 stall ; 5 store f4, 0(r1); 6 addi r1,r1, -8 ; 7 stall ; 8 bne r1,r2,Loop; 9 stall ;10 Execution

Instruction Scheduling Good instruction scheduling can reduce the execution time from 10 cycles to 6 cycles. Loop: load f0, 0(r1) ; 1 addi r1,r1, -8 ; 2 add f4,f0,f2 ; 3 stall ; 4 bne r1,r2,Loop; 5 store f4, 8(r1); 6 Requires Dependence analysis Symbolic optimization

Loop Unrolling The real computation requires only three instructions load, add, store Additional instruction for loop control (Overhead) Loop unrolling by a factor of k means The loop body is replicated k times. Accesses to the loop variable have to be adapted. The loop control needs to be adapted. Generation of a post loop if the number of iterations is not divisible by k.

Example Advantages of loop unrolling The ratio between useful instructions and overhead is improved. There are more operations available for instruction scheduling. For (i=1000; i>0; i=i-4){ x[i]=x[i]+s x[i-1]=x[i-1]+s x[i-2]=x[i-2]+s x[i-3]=x[i-3]+s }

Reduction of overhead Loop: load f0, 0(r1) ; x[i] add f4,f0,f2 ; store f4, 0(r1) ; load f6, -8(r1) ; x[i-1] add f8,f6,f2 ; store f8,-8(r1) ; load f10,-16(r1) ; x[i-2] add f12,f10,f2 ; store f12,-16(r1); load f14,-24(r1) ; x[i-3] add f16,f14,f2 ; store f16,-24(r1); addi r1,r1, -32 ; bne r1,r2,Loop ; cycles for 4 iterations Before 40 cycles for 4 iterations

Optimized scheduling of instructions Results in 3,5 cycles per iteration (6 before) Loop: load f0, 0(r1) ; x[i] load f6, -8(r1) ; x[i-1] load f10,-16(r1) ; x[i-2] load f14,-24(r1) ; x[i-3] add f4,f0,f2 ; add f8,f6,f2 ; add f12,f10,f2 ; add f16,f14,f2 ; store f4, 0(r1) ; store f8,-8(r1) ; addi r1,r1, -32 ; store f12,16(r1); bne r1,r2,Loop ; store f16,8(r1) ;

Register Allocation Using different registers allows reordering Loop: load f0, 0(r1) ; x[i] add f4,f0,f2 ; store f4, 0(r1) ; load f0, -8(r1) ; x[i-1] add f4,f0,f2 ; store f4,-8(r1) ; … Loop: load f0, 0(r1) ; x[i] stall add f4,f0,f2 ; load f0, -8(r1) ; x[i-1] stall store f4, 0(r1) ; add f4,f0,f2 ; stall store f4,-8(r1) ; …

Register Allocation Compiler starts with an unlimited number of virtual registers. These registers are then mapped with graph coloring to the registers in the ISA. Life range of a register: Instructions where a virtual register is life, i.e., from the definition of the register to the last access. Creation of a graph –Nodes are virtual registers –Edges are inserted if the life ranges overlap Goal: Coloring of nodes with a minimal number of colors, so that neighboring nodes do not have the same color. The number of colors has to be smaller or equal to the number of ISA registers.

Graph Coloring Three registers are required. In addition an index register. Loop: load v0, 0(r1) ; add v4,v0,v2 ; store v4, 0(r1) ; load v6, -8(r1) ; add v8,v6,v2 ; store v8,-8(r1) ; load v10,-16(r1) ; add v12,v10,v2 ; store v12,-16(r1); load v14,-24(r1) ; add v16,v14,v2 ; store v16,-24(r1); addi r1,r1, -32 ; bne r1,r2,Loop ; v0 v2 v4 v6 v8 v10 v12 v14 v16

Register Allocation after Instruction Scheduling Loop: load v0, 0(r1) ; load v6, -8(r1) ; load v10,-16(r1) ; load v14,-24(r1) ; add v4,v0,v2 ; add v8,v6,v2 ; add v12,v10,v2 ; add v16,v14,v2 ; store v4, 0(r1) ; store v8,-8(r1) ; addi r1,r1, -32 ; store v12,16(r1); bne r1,r2,Loop ; store v16,8(r1) ; v0 v4 v6 v8 v10 v12 v14 v16

Register Allocation after Instruction Scheduling 5 FP registers are required. Loop: load v0, 0(r1) ; load v6, -8(r1) ; load v10,-16(r1) ; load v14,-24(r1) ; add v4,v0,v2 ; add v8,v6,v2 ; add v12,v10,v2 ; add v16,v14,v2 ; store v4, 0(r1) ; store v8,-8(r1) ; addi r1,r1, -32 ; store v12,16(r1); bne r1,r2,Loop ; store v16,8(r1) ; v0 v4 v6 v8 v10 v12 v14 v16

Software Pipelining Execution with loop unrolling (a) and software pipelining (b) Number of overlapped operations Proportional to number of unrolls (a) Number of overlapped operations (b) Start-upWind-down

Software Pipelining Loops are restructured, such that in each iteration of the new loop different instructions of different iterations of the original loop are executed. Iteration 0 Iteration 1 Iteration 2 Iteration 3 Iteration 4

Example Software Pipelining Pipelined loop load f0, 0(r1) add f4,f0,f2 store f4, 0(r1) addi r1,r1, -8 bne r1,r2,Loop load f0, 0(r1) add f4,f0,f2 store f4, 0(r1) load f0, 0(r1) add f4,f0,f2 store f4, 0(r1) load f0, 0(r1) add f4,f0,f2 store f4, 0(r1) Iteration i: Iteration i-1: Iteration i-2: Loop: store f4,16(r1); stores into M[i] add f4,f0,f2 ; adds to M[i-1] load f0,0(r1) ; loads M[i-2] addi r1,r1, -8 bne r1,r2,Loop Original loop

Example: Software Pipelining Start-up code and wind-down code have been omitted. Requires Register Renaming to get rid of WAR- conflicts. Requires 5 cycles per iteration if the instruction scheduling will handle addi and jump as before.

Software Pipelining vs Loop Unrolling Software Pipelining is symbolic Loop Unrolling Algorithms are based on Loop Unrolling Advantage of Software Pipelining Results in shorter code, especially for long latencies. Reduces area of low overlap to start-up and wind-down loop. Advantage of Loop Unrolling Reduces loop overhead Advantage of both techniques Use independent operations from different loop iterations. Best results by combining both techniques.

Loop fusion Loop fusion combines subsequent loops with same loop control. Instructions might be executed more efficiently. Loop fusion is not always possible. do i=1,n a(i)= b(i)+2 enddo do i=1,n c(i)= d(i+1) * a(i) enddo do i=1,n a(i)= b(i)+2 c(i)= d(i+1) * a(i) enddo

Example: Incorrect Loop Fusion do i=1,n S1: a(i)= b(i)+2 enddo do i=1,n S2: c(i)= d(i+1) * a(i+1) enddo do i=1,n S1: a(i)= b(i)+2 S2: c(i)= d(i+1) * a(i+1) enddo

Example: Correct Loop Fusion do i=1,n S1: a(i)= b(i)+2 enddo do i=1,n S2: c(i)= d(i+1) * a(i-1) enddo do i=1,n S1: a(i)= b(i)+2 S2: c(i)= d(i+1) * a(i-1) enddo

Advantages of Transformations Increase the number of independent instructions. These can be scheduled and executed more efficiently.

Disadvantages of the Transformations Transformations increase reigster pressure. They increase the size of the code which might lead to a more inefficient usage of the memory hierarchy. Transformations can also lead to less data locality.

Summary of Transformations Compiler has a global overview. Goal: More operations for instruction scheduling. Compiler supports efficient execution in other areas.