1 Lecture 6: Static ILP Topics: loop analysis, SW pipelining, predication, speculation (Section 2.2, Appendix G)

Slides:



Advertisements
Similar presentations
Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
Advertisements

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
CS 6461: Computer Architecture Basic Compiler Techniques for Exposing ILP Instructor: Morris Lancaster Corresponding to Hennessey and Patterson Fifth Edition.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.
Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.
1 Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
Anshul Kumar, CSE IITD CS718 : VLIW - Software Driven ILP Introduction 23rd Mar, 2006.
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
1 Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2)
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)
EECC551 - Shaaban #1 Fall 2005 lec# Static Compiler Optimization Techniques We examined the following static ISA/compiler techniques aimed.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Computer Architecture.
1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.
1 Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)
1 Lecture 7: Static ILP, Branch prediction Topics: static ILP wrap-up, bimodal, global, local branch prediction (Sections )
Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.
Instruction Level Parallelism (ILP) Colin Stevens.
EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
EECC551 - Shaaban #1 Spring 2004 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.
EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Winter 2003 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
Chapter 2 Instruction-Level Parallelism and Its Exploitation
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
1 Lecture 6: Static ILP Topics: loop analysis, SW pipelining, predication, speculation (Section 2.2, Appendix G) Assignment 2 posted; due in a week.
EECC551 - Shaaban #1 Winter 2002 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.
COMP381 by M. Hamdi 1 Loop Level Parallelism Instruction Level Parallelism: Loop Level Parallelism.
1 Lecture 6: Static ILP Topics: loop analysis, SW pipelining, predication, speculation (Section 2.2, Appendix G) Please hand in Assignment 1 now Assignment.
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
Instruction Rescheduling and Loop-Unroll Department of Computer Science Southern Illinois University Edwardsville Fall, 2015 Dr. Hiroshi Fujinoki
1 Lecture 12: Advanced Static ILP Topics: parallel loops, software speculation (Sections )
1 Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)
1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.
Compiler Techniques for ILP
Approaches to exploiting Instruction Level Parallelism (ILP)
CSCE430/830 Computer Architecture
Lecture 11: Advanced Static ILP
CSL718 : VLIW - Software Driven ILP
Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)
Lecture 6: Static ILP, Branch prediction
Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
Compiler techniques for exposing ILP (cont)
Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)
Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2)
Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)
Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2) HW3 posted, due in a week.
CS 704 Advanced Computer Architecture
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Instruction Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Dynamic Hardware Prediction
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
How to improve (decrease) CPI
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Loop-Level Parallelism
Static Scheduling Techniques
Lecture 5: Pipeline Wrap-up, Static ILP
Presentation transcript:

1 Lecture 6: Static ILP Topics: loop analysis, SW pipelining, predication, speculation (Section 2.2, Appendix G)

2 Loop Dependences If a loop only has dependences within an iteration, the loop is considered parallel  multiple iterations can be executed together so long as order within an iteration is preserved If a loop has dependeces across iterations, it is not parallel and these dependeces are referred to as “loop-carried” Not all loop-carried dependences imply lack of parallelism

3 Examples For (i=1000; i>0; i=i-1) x[i] = x[i] + s; For (i=1; i<=100; i=i+1) { A[i+1] = A[i] + C[i]; S1 B[i+1] = B[i] + A[i+1]; S2 } For (i=1; i<=100; i=i+1) { A[i] = A[i] + B[i]; S1 B[i+1] = C[i] + D[i]; S2 } For (i=1000; i>0; i=i-1) x[i] = x[i-3] + s; S1

4 Examples For (i=1000; i>0; i=i-1) x[i] = x[i] + s; For (i=1; i<=100; i=i+1) { A[i+1] = A[i] + C[i]; S1 B[i+1] = B[i] + A[i+1]; S2 } For (i=1; i<=100; i=i+1) { A[i] = A[i] + B[i]; S1 B[i+1] = C[i] + D[i]; S2 } S2 depends on S1 in the same iteration S1 depends on S1 from prev iteration S2 depends on S2 from prev iteration S1 depends on S2 from prev iteration For (i=1000; i>0; i=i-1) x[i] = x[i-3] + s; S1 S1 depends on S1 from 3 prev iterations Referred to as a recursion Dependence distance 3; limited parallelism No dependences

5 Constructing Parallel Loops If loop-carried dependences are not cyclic (S1 depending on S1 is cyclic), loops can be restructured to be parallel For (i=1; i<=100; i=i+1) { A[i] = A[i] + B[i]; S1 B[i+1] = C[i] + D[i]; S2 } A[1] = A[1] + B[1]; For (i=1; i<=99; i=i+1) { B[i+1] = C[i] + D[i]; S3 A[i+1] = A[i+1] + B[i+1]; S4 } B[101] = C[100] + D[100]; S1 depends on S2 from prev iteration S4 depends on S3 of same iteration

6 Finding Dependences – the GCD Test Do A[ai + b] and A[ci + d] refer to the same element? Restrict ourselves to affine array indices (expressible as ai + b, where i is the loop index, a and b are constants) – example of non-affine index: x[y[i]] For a dependence to exist, must have two indices j and k that are within the loop bounds, such that aj + b = ck + d; aj – ck = d – b; G = GCD(a,c); (aj/G - ck/G) = (d-b)/G; If (d-b)/G is not an integer, the initial equality can not be true

7 Software Pipeline?! L.DADD.DS.D DADDUIBNE L.DADD.DS.D L.DADD.DS.D L.DADD.DS.D L.DADD.D L.DADD.D DADDUIBNE DADDUIBNE DADDUIBNE DADDUIBNE DADDUIBNE … … Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1,# -8 BNE R1, R2, Loop

8 Software Pipelining Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1,# -8 BNE R1, R2, Loop Loop: S.D F4, 16(R1) ADD.D F4, F0, F2 L.D F0, 0(R1) DADDUI R1, R1,# -8 BNE R1, R2, Loop Advantages: achieves nearly the same effect as loop unrolling, but without the code expansion – an unrolled loop may have inefficiencies at the start and end of each iteration, while a sw-pipelined loop is almost always in steady state – a sw-pipelined loop can also be unrolled to reduce loop overhead Disadvantages: does not reduce loop overhead, may require more registers

9 Predication A branch within a loop can be problematic to schedule Control dependences are a problem because of the need to re-fetch on a mispredict For short loop bodies, control dependences can be converted to data dependences by using predicated/conditional instructions

10 Predicated or Conditional Instructions The instruction has an additional operand that determines whether the instr completes or gets converted into a no-op Example: lwc R1, 0(R2), R3 (load-word-conditional) will load the word at address (R2) into R1 if R3 is non-zero; if R3 is zero, the instruction becomes a no-op Replaces a control dependence with a data dependence (branches disappear) ; may need register copies for the condition or for values used by both directions if (R1 == 0) R2 = R2 + R4 else R6 = R3 + R5 R4 = R2 + R3 R7 = !R1 ; R8 = R2 ; R2 = R2 + R4 (predicated on R7) R6 = R3 + R5 (predicated on R1) R4 = R8 + R3 (predicated on R1)

11 Complications Each instruction has one more input operand – more register ports/bypassing If the branch condition is not known, the instruction stalls (remember, these are in-order processors) Some implementations allow the instruction to continue without the branch condition and squash/complete later in the pipeline – wasted work Increases register pressure, activity on functional units Does not help if the br-condition takes a while to evaluate

12 Title Bullet