1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.

Slides:



Advertisements
Similar presentations
CS 378 Programming for Performance Single-Thread Performance: Compiler Scheduling for Pipelines Adopted from Siddhartha Chatterjee Spring 2009.
Advertisements

Anshul Kumar, CSE IITD CSL718 : VLIW - Software Driven ILP Hardware Support for Exposing ILP at Compile Time 3rd Apr, 2006.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
HW 2 is out! Due 9/25!. CS 6290 Static Exploitation of ILP.
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
Compiler techniques for exposing ILP
1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
Loop Unrolling & Predication CSE 820. Michigan State University Computer Science and Engineering Software Pipelining With software pipelining a reorganized.
CS 6461: Computer Architecture Basic Compiler Techniques for Exposing ILP Instructor: Morris Lancaster Corresponding to Hennessey and Patterson Fifth Edition.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
CPE 631: ILP, Static Exploitation Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic,
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.
FTC.W99 1 Advanced Pipelining and Instruction Level Parallelism (ILP) ILP: Overlap execution of unrelated instructions gcc 17% control transfer –5 instructions.
Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.
COMP4611 Tutorial 6 Instruction Level Parallelism
1 Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
Eliminating Stalls Using Compiler Support. Instruction Level Parallelism gcc 17% control transfer –5 instructions + 1 branch –Reordering among 5 instructions.
ILP: Loop UnrollingCSCE430/830 Instruction-level parallelism: Loop Unrolling CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
1 Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2)
EEL Advanced Pipelining and Instruction Level Parallelism Lotzi Bölöni.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Rung-Bin Lin Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches4-1 Chapter 4 Exploiting Instruction-Level Parallelism with Software.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
CS152 Lec15.1 Advanced Topics in Pipelining Loop Unrolling Super scalar and VLIW Dynamic scheduling.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Lecture 3: Chapter 2 Instruction Level Parallelism Dr. Eng. Amr T. Abdel-Hamid CSEN 601 Spring 2011 Computer Architecture Text book slides: Computer Architec.
Static Scheduling for ILP Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)
1 IF IDEX MEM L.D F4,0(R2) MUL.D F0, F4, F6 ADD.D F2, F0, F8 L.D F2, 0(R2) WB IF IDM1 MEM WBM2M3M4M5M6M7 stall.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Computer Architecture.
EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 29, 2003 Topic: Software Approaches for ILP (Compiler Techniques) contd.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
Chapter 2 Instruction-Level Parallelism and Its Exploitation
EECC551 - Shaaban #1 Fall 2002 lec# Floating Point/Multicycle Pipelining in MIPS Completion of MIPS EX stage floating point arithmetic operations.
EECC551 - Shaaban #1 Winter 2011 lec# Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
COMP381 by M. Hamdi 1 Loop Level Parallelism Instruction Level Parallelism: Loop Level Parallelism.
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
EECC551 - Shaaban #1 Fall 2001 lec# Floating Point/Multicycle Pipelining in DLX Completion of DLX EX stage floating point arithmetic operations.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
Compiler Techniques for ILP
CSCE430/830 Computer Architecture
CSL718 : VLIW - Software Driven ILP
Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2)
Siddhartha Chatterjee Spring 2008
Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2) HW3 posted, due in a week.
CS 704 Advanced Computer Architecture
Adapted from the slides of Prof
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Instruction Level Parallelism (ILP)
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Dynamic Hardware Prediction
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
How to improve (decrease) CPI
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CMSC 611: Advanced Computer Architecture
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Loop-Level Parallelism
Lecture 5: Pipeline Wrap-up, Static ILP
Presentation transcript:

1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler Techniques)

2Outline  Motivation  Compiler scheduling Loop unrolling Loop unrolling Software pipelining Software pipelining

3 Review: Instruction-Level Parallelism (ILP)  Pipelining most effective when: parallelism among instrs instrs u and v are parallel if neither is dependent on the other instrs u and v are parallel if neither is dependent on the other  Problem: parallelism within a basic block is limited branch freq of 15%: implies about 6 instructions in basic block branch freq of 15%: implies about 6 instructions in basic block these instructions are likely to depend on each other these instructions are likely to depend on each other  need to look beyond basic blocks  Solution: exploit loop-level parallelism i.e., parallelism across loop iterations i.e., parallelism across loop iterations to convert loop-level parallelism into ILP, need to “unroll” the loop to convert loop-level parallelism into ILP, need to “unroll” the loop  dynamically, by the hardware  statically, by the compiler  using vector instructions: same op applied to all vector elements

4 Motivating Example for Loop Unrolling for (i = 1000; i > 0; i--) x[i] = x[i] + s; Assumptions Scalar s is in register F2 Array x starts at memory address 0 1-cycle branch delay No structural hazards LOOP: L.DF0, 0(R1) ADD.DF4, F0, F2 S.D 0(R1), F4 DADDUIR1, R1, -8 BNEZR1, LOOP NOP LOOP: L.DF0, 0(R1) ADD.DF4, F0, F2 S.D 0(R1), F4 DADDUIR1, R1, -8 BNEZR1, LOOP NOP 10 cycles per iteration

5 How Far Can We Get With Scheduling? LOOP: L.DF0, 0(R1) DADDUIR1, R1, -8 ADD.DF4, F0, F2 nop BNEZR1, LOOP S.D 8(R1), F4 LOOP: L.DF0, 0(R1) DADDUIR1, R1, -8 ADD.DF4, F0, F2 nop BNEZR1, LOOP S.D 8(R1), F4 LOOP: L.DF0, 0(R1) ADD.DF4, F0, F2 S.D 0(R1), F4 DADDUIR1, R1, -8 BNEZR1, LOOP NOP LOOP: L.DF0, 0(R1) ADD.DF4, F0, F2 S.D 0(R1), F4 DADDUIR1, R1, -8 BNEZR1, LOOP NOP 6 cycles per iteration Note change in S.D instruction, from 0(R1) to 8(R1) ; this is a non-trivial change!

6 Observations on Scheduled Code  3 out of 5 instructions involve FP work  The other two constitute loop overhead  Could we improve performance by unrolling the loop? assume number of loop iterations is a multiple of 4, and unroll loop body four times assume number of loop iterations is a multiple of 4, and unroll loop body four times  in real life, must also handle loop counts that are not multiples of 4

7 Unrolling: Take 1  Even though we have gotten rid of the control dependences, we have data dependences through R1  We could remove data dependences by observing that R1 is decremented by 8 each time Adjust the address specifiers Adjust the address specifiers Delete the first three DADDUI’s Delete the first three DADDUI’s Change the constant in the fourth DADDUI to 32 Change the constant in the fourth DADDUI to 32  These are non-trivial inferences for a compiler to make LOOP: L.DF0, 0(R1) ADD.DF4, F0, F2 S.D0(R1), F4 DADDUIR1, R1, -8 L.DF0, 0(R1) ADD.DF4, F0, F2 S.D0(R1), F4 DADDUIR1, R1, -8 L.DF0, 0(R1) ADD.DF4, F0, F2 S.D0(R1), F4 DADDUIR1, R1, -8 L.DF0, 0(R1) ADD.DF4, F0, F2 S.D0(R1), F4 DADDUIR1, R1, -8 BNEZR1, LOOP NOP LOOP: L.DF0, 0(R1) ADD.DF4, F0, F2 S.D0(R1), F4 DADDUIR1, R1, -8 L.DF0, 0(R1) ADD.DF4, F0, F2 S.D0(R1), F4 DADDUIR1, R1, -8 L.DF0, 0(R1) ADD.DF4, F0, F2 S.D0(R1), F4 DADDUIR1, R1, -8 L.DF0, 0(R1) ADD.DF4, F0, F2 S.D0(R1), F4 DADDUIR1, R1, -8 BNEZR1, LOOP NOP

8 Unrolling: Take 2  Performance is now limited by the WAR dependencies on F0  These are name dependences The instructions are not in a producer-consumer relation The instructions are not in a producer-consumer relation They are simply using the same registers, but they don’t have to They are simply using the same registers, but they don’t have to We can use different registers in different loop iterations, subject to availability We can use different registers in different loop iterations, subject to availability  Let’s rename registers LOOP: L.DF0, 0(R1) ADD.DF4, F0, F2 S.D0(R1), F4 L.DF0, -8(R1) ADD.DF4, F0, F2 S.D -8(R1), F4 L.DF0, -16(R1) ADD.DF4, F0, F2 S.D-16(R1), F4 L.DF0, -24(R1) ADD.DF4, F0, F2 S.D-24(R1), F4 DADDUIR1, R1, -32 BNEZR1, LOOP NOP LOOP: L.DF0, 0(R1) ADD.DF4, F0, F2 S.D0(R1), F4 L.DF0, -8(R1) ADD.DF4, F0, F2 S.D -8(R1), F4 L.DF0, -16(R1) ADD.DF4, F0, F2 S.D-16(R1), F4 L.DF0, -24(R1) ADD.DF4, F0, F2 S.D-24(R1), F4 DADDUIR1, R1, -32 BNEZR1, LOOP NOP

9 Unrolling: Take 3 LOOP: L.DF0, 0(R1) ADD.DF4, F0, F2 S.D0(R1), F4 L.DF6, -8(R1) ADD.DF8, F6, F2 S.D-8(R1), F8 L.DF10, -16(R1) ADD.DF12, F10, F2 S.D-16(R1), F12 L.DF14, -24(R1) ADD.DF16, F14, F2 S.D-24(R1), F16 DADDUIR1, R1, -32 BNEZR1, LOOP NOP LOOP: L.DF0, 0(R1) ADD.DF4, F0, F2 S.D0(R1), F4 L.DF6, -8(R1) ADD.DF8, F6, F2 S.D-8(R1), F8 L.DF10, -16(R1) ADD.DF12, F10, F2 S.D-16(R1), F12 L.DF14, -24(R1) ADD.DF16, F14, F2 S.D-24(R1), F16 DADDUIR1, R1, -32 BNEZR1, LOOP NOP  Time for execution of 4 iterations 14 instruction cycles 14 instruction cycles 4 L.D  ADD.D stalls 4 L.D  ADD.D stalls 8 ADD.D  S.D stalls 8 ADD.D  S.D stalls 1 DADDUI  BNEZ stall 1 DADDUI  BNEZ stall 1 branch delay stall (NOP) 1 branch delay stall (NOP)  28 cycles for 4 iterations, or 7 cycles per iteration  Slower than scheduled version of original loop, which needed 6 cycles per iteration  Let’s schedule the unrolled loop

10 Unrolling: Take 4  This code runs without stalls 14 cycles for 4 iterations 14 cycles for 4 iterations 3.5 cycles per iteration 3.5 cycles per iteration loop control overhead = once every four iterations loop control overhead = once every four iterations  Note that original loop had three FP instructions that were not independent  Loop unrolling exposed independent instructions from multiple loop iterations  By unrolling further, can approach asymptotic rate of 3 cycles per instruction Subject to availability of registers Subject to availability of registers LOOP: L.DF0, 0(R1) L.DF6, -8(R1) L.DF10, -16(R1) L.DF14, -24(R1) ADD.DF4, F0, F2 ADD.DF8, F6, F2 ADD.DF12, F10, F2 ADD.DF16, F14, F2 S.D0(R1), F4 S.D-8(R1), F8 DADDUIR1, R1, -32 S.D16(R1), F12 BNEZR1, LOOP S.D8(R1), F16 LOOP: L.DF0, 0(R1) L.DF6, -8(R1) L.DF10, -16(R1) L.DF14, -24(R1) ADD.DF4, F0, F2 ADD.DF8, F6, F2 ADD.DF12, F10, F2 ADD.DF16, F14, F2 S.D0(R1), F4 S.D-8(R1), F8 DADDUIR1, R1, -32 S.D16(R1), F12 BNEZR1, LOOP S.D8(R1), F16

11 What Did The Compiler Have To Do?  Determine it was legal to move the S.D… …after the DADDUI and BNEZ …after the DADDUI and BNEZ and find the amount to adjust the S.D offset and find the amount to adjust the S.D offset  Determine that loop unrolling would be useful… …by discovering independence of loop iterations …by discovering independence of loop iterations  Rename registers to avoid name dependences  Eliminate extra tests and branches and adjust loop control  Determine that L.D’s and S.D’s can be interchanged… …by determining that (since R1 is not being updated) the address specifiers 0(R1), -8(R1), -16(R1), -24(R1) all refer to different memory locations …by determining that (since R1 is not being updated) the address specifiers 0(R1), -8(R1), -16(R1), -24(R1) all refer to different memory locations  Schedule the code, preserving dependences

12 Limits to Gain from Loop Unrolling  Benefit of reduction in loop overhead tapers off Amount of overhead amortized diminishes with successive unrolls Amount of overhead amortized diminishes with successive unrolls  Code size limitations For larger loops, code size growth is a concern For larger loops, code size growth is a concern  Especially for embedded processors with limited memory Instruction cache miss rate increases Instruction cache miss rate increases  Architectural/compiler limitations Register pressure Register pressure  Need many registers to exploit ILP  Especially challenging in multiple-issue architectures

13Dependences  Three kinds of dependences Data dependence Data dependence Name dependence Name dependence Control dependence Control dependence  In the context of loop-level parallelism, data dependence can be Loop-independent Loop-independent Loop-carried Loop-carried  Data dependences act as a limit of how much ILP can be exploited in a compiled program Compiler tries to identify and eliminate dependences Compiler tries to identify and eliminate dependences Hardware tries to prevent dependences from becoming stalls Hardware tries to prevent dependences from becoming stalls

14 Control Dependences  A control dependence determines the ordering of an instruction with respect to a branch instruction so that the non-branch instruction is executed only when it should be if (p1) {s1;} if (p2) {s2;}  Control dependence constrains code motion An instruction that is control dependent on a branch cannot be moved before the branch so that its execution is no longer controlled by the branch An instruction that is control dependent on a branch cannot be moved before the branch so that its execution is no longer controlled by the branch An instruction that is not control dependent on a branch cannot be moved after the branch so that its execution is controlled by the branch An instruction that is not control dependent on a branch cannot be moved after the branch so that its execution is controlled by the branch

15 Data Dependence in Loop Iterations A[u+1] = A[u]+C[u]; B[u+1] = B[u]+A[u+1]; A[u+1] = A[u]+C[u]; B[u+1] = B[u]+A[u+1]; A[u+1] = A[u]+C[u]; B[u+1] = B[u]+A[u+1]; A[u+2] = A[u+1]+C[u+1]; B[u+2] = B[u+1]+A[u+2]; A[u+1] = A[u]+C[u]; B[u+1] = B[u]+A[u+1]; A[u+2] = A[u+1]+C[u+1]; B[u+2] = B[u+1]+A[u+2]; A[u] = A[u]+B[u]; B[u+1] = C[u]+D[u]; A[u] = A[u]+B[u]; B[u+1] = C[u]+D[u]; A[u] = A[u]+B[u]; B[u+1] = C[u]+D[u]; A[u+1] = A[u+1]+B[u+1]; B[u+2] = C[u+1]+D[u+1]; A[u] = A[u]+B[u]; B[u+1] = C[u]+D[u]; A[u+1] = A[u+1]+B[u+1]; B[u+2] = C[u+1]+D[u+1]; B[u+1] = C[u]+D[u]; A[u+1] = A[u+1]+B[u+1]; B[u+1] = C[u]+D[u]; A[u+1] = A[u+1]+B[u+1]; B[u+1] = C[u]+D[u]; A[u+1] = A[u+1]+B[u+1]; B[u+2] = C[u+1]+D[u+1]; A[u+2] = A[u+2]+B[u+2]; B[u+1] = C[u]+D[u]; A[u+1] = A[u+1]+B[u+1]; B[u+2] = C[u+1]+D[u+1]; A[u+2] = A[u+2]+B[u+2];

16 Loop Transformation A[u] = A[u]+B[u]; B[u+1] = C[u]+D[u]; A[u] = A[u]+B[u]; B[u+1] = C[u]+D[u];  Sometimes loop-carried dependence does not prevent loop parallelization Example: Second loop of previous slide Example: Second loop of previous slide  In other cases, loop-carried dependence prohibits loop parallelization Example: First loop of previous slide Example: First loop of previous slide A[u] = A[u]+B[u]; B[u+1] = C[u]+D[u]; A[u+1] = A[u+1]+B[u+1]; B[u+2] = C[u+1]+D[u+1]; A[u+2] = A[u+2]+B[u+2]; B[u+3] = C[u+2]+D[u+2]; A[u] = A[u]+B[u]; B[u+1] = C[u]+D[u]; A[u+1] = A[u+1]+B[u+1]; B[u+2] = C[u+1]+D[u+1]; A[u+2] = A[u+2]+B[u+2]; B[u+3] = C[u+2]+D[u+2]; A[u] = A[u]+B[u]; B[u+1] = C[u]+D[u]; A[u+1] = A[u+1]+B[u+1]; B[u+2] = C[u+1]+D[u+1]; A[u+2] = A[u+2]+B[u+2]; B[u+3] = C[u+2]+D[u+2]; A[u] = A[u]+B[u]; B[u+1] = C[u]+D[u]; A[u+1] = A[u+1]+B[u+1]; B[u+2] = C[u+1]+D[u+1]; A[u+2] = A[u+2]+B[u+2]; B[u+3] = C[u+2]+D[u+2];

17 Software Pipelining  Observation If iterations from loops are independent, then we can get ILP by taking instructions from different iterations If iterations from loops are independent, then we can get ILP by taking instructions from different iterations  Software pipelining reorganize loops so that each iteration is made from instructions chosen from different iterations of the original loop reorganize loops so that each iteration is made from instructions chosen from different iterations of the original loop i4 i3 i2 i1 i0 Software Pipeline Iteration

18 Software Pipelining Example After: Software Pipelined L.DF0,0(R1) ADD.DF4,F0,F2 L.DF0,-8(R1) 1 S.D0(R1),F4; Stores M[i] 2 ADD.DF4,F0,F2; Adds to M[i-1] 3 L.DF0,-16(R1); loads M[i-2] 4 DADDUI R1,R1,-8 5 BNEZR1,LOOP S.D0(R1),F4 ADD.DF4,F0,F2 S.D-8(R1),F4 IF ID EX Mem WB S.D ADD.D L.D Read F4 Write F4 Read F0 Write F0 Before: Unrolled 3 times 1 L.DF0,0(R1) 2 ADD.DF4,F0,F2 3 S.D0(R1),F4 4 L.DF0,-8(R1) 5 ADD.DF4,F0,F2 6 S.D-8(R1),F4 7 L.DF0,-16(R1) 8 ADD.DF4,F0,F2 9 S.D-16(R1),F4 10 DADDUI R1,R1, BNEZR1,LOOP (As in slide 12)

19 Software Pipelining: Concept Loop: L i E i S i B Loop Loop: L i E i S i B Loop L 1 E 1 S 1 B Loop L 2 E 2 S 2 B Loop L 3 E 3 S 3 B Loop … L n E n S n  Notation: Load, Execute, Store  Iterations are independent  In normal sequence, E i depends on L i, and S i depends on E i, leading to pipeline stalls  Software pipelining attempts to reduce these delays by inserting other instructions between such dependent pairs and “hiding” the delay “Other” instructions are L and S instructions from other loop iterations “Other” instructions are L and S instructions from other loop iterations  Does this without consuming extra code space or registers Performance usually not as high as that of loop unrolling Performance usually not as high as that of loop unrolling  How can we permute L, E, S to achieve this? “A Study of Scalar Compilation Techniques for Pipelined Supercomputers”, S. Weiss and J. E. Smith, ISCA 1987, pages

20 An Abstract View of Software Pipelining Loop: L i E i S i B Loop Loop: L i E i S i B Loop L 1 Loop: E i S i L i+1 B Loop E n S n L 1 Loop: E i S i L i+1 B Loop E n S n J Entry Loop: S i-1 Entry: L i E i B Loop S n J Entry Loop: S i-1 Entry: L i E i B Loop S n L 1 J Entry Loop: S i-1 Entry: E i L i+1 B Loop S n-1 E n S n L 1 J Entry Loop: S i-1 Entry: E i L i+1 B Loop S n-1 E n S n L 1 Loop: E i L i+1 S i B Loop E n S n L 1 Loop: E i L i+1 S i B Loop E n S n L 1 J Entry Loop: L i S i-1 Entry: E i B Loop S n L 1 J Entry Loop: L i S i-1 Entry: E i B Loop S n Maintains original L/S order Changes original L/S order

21 Other Compiler Techniques  Static Branch Prediction Examples: Examples:  predict always taken  predict never taken  predict: forward never taken, backward always taken Stall needed after LD Stall needed after LD  if branch almost always taken, and R7 not needed in fall-thru –move DADDU R7, R8, R9 to right after LD  if branch almost never taken, and R4 not needed on taken path –move OR instruction to right after LD LDR1, 0(R2) DSUBUR1, R1, R3 BEQZR1, L ORR4, R5, R6 DADDUR10, R4, R3 L:DADDUR7, R8, R9 LDR1, 0(R2) DSUBUR1, R1, R3 BEQZR1, L ORR4, R5, R6 DADDUR10, R4, R3 L:DADDUR7, R8, R9

22 Very Long Instruction Word (VLIW)  VLIW: compiler schedules multiple instructions/issue The long instruction word has room for many operations The long instruction word has room for many operations By definition, all the operations the compiler puts in the long instruction word can execute in parallel By definition, all the operations the compiler puts in the long instruction word can execute in parallel E.g., 2 integer operations, 2 FP operations, 2 memory references, 1 branch E.g., 2 integer operations, 2 FP operations, 2 memory references, 1 branch  16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168 bits wide Need very sophisticated compiling technique … Need very sophisticated compiling technique …  … that schedules across several branches

23 Loop Unrolling in VLIW Memory MemoryFPFPInt. op/Clock reference 1reference 2operation 1 op. 2 branch LD F0,0(R1)LD F6,-8(R1)1 LD F10,-16(R1)LD F14,-24(R1)2 LD F18,-32(R1)LD F22,-40(R1)ADDD F4,F0,F2ADDD F8,F6,F23 LD F26,-48(R1)ADDD F12,F10,F2ADDD F16,F14,F24 ADDD F20,F18,F2ADDD F24,F22,F25 SD 0(R1),F4SD -8(R1),F8ADDD F28,F26,F26 SD -16(R1),F12SD -24(R1),F167 SD -32(R1),F20SD -40(R1),F24SUBI R1,R1,#488 SD -0(R1),F28BNEZ R1,LOOP9  Unrolled 7 times to avoid delays  7 results in 9 clocks, or 1.3 clocks/iter (down from 6)  Need more registers in VLIW