Lecture 11: Advanced Static ILP

Slides:



Advertisements
Similar presentations
Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
Advertisements

CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
CS 6461: Computer Architecture Basic Compiler Techniques for Exposing ILP Instructor: Morris Lancaster Corresponding to Hennessey and Patterson Fifth Edition.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.
Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.
1 Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
Anshul Kumar, CSE IITD CS718 : VLIW - Software Driven ILP Introduction 23rd Mar, 2006.
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
1 Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2)
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)
1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.
1 Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)
Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.
Instruction Level Parallelism (ILP) Colin Stevens.
EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
EECC551 - Shaaban #1 Spring 2004 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Winter 2003 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
Chapter 2 Instruction-Level Parallelism and Its Exploitation
EECC551 - Shaaban #1 Fall 2002 lec# Floating Point/Multicycle Pipelining in MIPS Completion of MIPS EX stage floating point arithmetic operations.
1 Lecture 6: Static ILP Topics: loop analysis, SW pipelining, predication, speculation (Section 2.2, Appendix G)
EECC551 - Shaaban #1 Winter 2011 lec# Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
1 Lecture 6: Static ILP Topics: loop analysis, SW pipelining, predication, speculation (Section 2.2, Appendix G) Assignment 2 posted; due in a week.
EECC551 - Shaaban #1 Winter 2002 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.
COMP381 by M. Hamdi 1 Loop Level Parallelism Instruction Level Parallelism: Loop Level Parallelism.
1 Lecture 6: Static ILP Topics: loop analysis, SW pipelining, predication, speculation (Section 2.2, Appendix G) Please hand in Assignment 1 now Assignment.
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
Instruction Rescheduling and Loop-Unroll Department of Computer Science Southern Illinois University Edwardsville Fall, 2015 Dr. Hiroshi Fujinoki
1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.
Compiler Techniques for ILP
CS203 – Advanced Computer Architecture
CS5100 Advanced Computer Architecture Instruction-Level Parallelism
Approaches to exploiting Instruction Level Parallelism (ILP)
CSCE430/830 Computer Architecture
CSL718 : VLIW - Software Driven ILP
CPE 631 Lecture 13: Exploiting ILP with SW Approaches
Lecture 6: Static ILP, Branch prediction
Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
Compiler techniques for exposing ILP (cont)
CS 704 Advanced Computer Architecture
Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)
Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2)
Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)
Siddhartha Chatterjee Spring 2008
Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2) HW3 posted, due in a week.
CS 704 Advanced Computer Architecture
CS 704 Advanced Computer Architecture
Adapted from the slides of Prof
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Version-6: Software Pipelining
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Dynamic Hardware Prediction
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CPE 631 Lecture 14: Exploiting ILP with SW Approaches (2)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Loop-Level Parallelism
Lecture 5: Pipeline Wrap-up, Static ILP
Presentation transcript:

Lecture 11: Advanced Static ILP Topics: loop unrolling, software pipelining (Section 4.4)

Loop Dependences If a loop only has dependences within an iteration, the loop is considered parallel  multiple iterations can be executed together so long as order within an iteration is preserved If a loop has dependeces across iterations, it is not parallel and these dependeces are referred to as “loop-carried” Not all loop-carried dependences imply lack of parallelism Parallel loops are especially desireable in a multiprocessor system

Examples For (i=1000; i>0; i=i-1) x[i] = x[i] + s; No dependences A[i+1] = A[i] + C[i]; S1 B[i+1] = B[i] + A[i+1]; S2 } S2 depends on S1 in the same iteration S1 depends on S1 from prev iteration S2 depends on S2 from prev iteration For (i=1; i<=100; i=i+1) { A[i] = A[i] + B[i]; S1 B[i+1] = C[i] + D[i]; S2 } S1 depends on S2 from prev iteration S1 depends on S1 from 3 prev iterations Referred to as a recursion Dependence distance 3; limited parallelism For (i=1000; i>0; i=i-1) x[i] = x[i-3] + s; S1

Finding Dependences – the GCD Test Do A[ai + b] and A[ci + d] refer to the same element? Restrict ourselves to affine array indices (expressible as ai + b, where i is the loop index, a and b are constants) – example of non-affine index: x[y[i]] For a dependence to exist, must have two indices j and k that are within the loop bounds, such that aj + b = ck + d; aj – ck = d – b; G = GCD(a,c); (aj/G - ck/G) = (d-b)/G; If (d-b)/G is not an integer, the initial equality can not be true

Static vs. Dynamic ILP Loop: L.D F0, 0(R1) ; F0 = array element ADD.D F4, F0, F2 ; add scalar S.D F4, 0(R1) ; store result DADDUI R1, R1,# -8 ; decrement address pointer BNE R1, R2, Loop ; branch if R1 != R2 Loop: L.D F0, 0(R1) L.D F6, -8(R1) L.D F10,-16(R1) L.D F14, -24(R1) ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 S.D F4, 0(R1) S.D F8, -8(R1) DADDUI R1, R1, # -32 S.D F12, 16(R1) BNE R1,R2, Loop S.D F16, 8(R1) L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1,# -8 BNE R1, R2, Loop ….. Statically unrolled loop Large window dynamic ooo proc

Dynamic ILP L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1,# -8 BNE R1, R2, Loop L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R3, R1,# -8 BNE R3, R2, Loop L.D F6, 0(R3) ADD.D F8, F6, F2 S.D F8, 0(R3) DADDUI R4, R3,# -8 BNE R4, R2, Loop L.D F10, 0(R4) ADD.D F12, F10, F2 S.D F12, 0(R4) DADDUI R5, R4,# -8 BNE R5, R2, Loop L.D F14, 0(R5) ADD.D F16, F14, F2 S.D F16, 0(R5) DADDUI R6, R5,# -8 BNE R6, R2, Loop Renamed

Dynamic ILP L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1,# -8 BNE R1, R2, Loop L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R3, R1,# -8 BNE R3, R2, Loop L.D F6, 0(R3) ADD.D F8, F6, F2 S.D F8, 0(R3) DADDUI R4, R3,# -8 BNE R4, R2, Loop L.D F10, 0(R4) ADD.D F12, F10, F2 S.D F12, 0(R4) DADDUI R5, R4,# -8 BNE R5, R2, Loop L.D F14, 0(R5) ADD.D F16, F14, F2 S.D F16, 0(R5) DADDUI R6, R5,# -8 BNE R6, R2, Loop 1 3 6 2 4 7 5 8 9 Cycle of Issue Renamed

… … Loop Pipeline L.D ADD.D S.D DADDUI BNE L.D ADD.D S.D DADDUI BNE

Statically Unrolled Loop Loop: L.D F0, 0(R1) L.D F6, -8(R1) L.D F10,-16(R1) L.D F14, -24(R1) L.D F18, -32(R1) ADD.D F4, F0, F2 L.D F22, -40(R1) ADD.D F8, F6, F2 L.D F26, -48(R1) ADD.D F12, F10, F2 L.D F30, -56(R1) ADD.D F16, F14, F2 L.D F34, -64(R1) ADD.D F20, F18, F2 S.D F4, 0(R1) L.D F38, -72(R1) ADD.D F24, F22, F2 S.D F8, -8(R1) … … S.D F12, 16(R1) S.D F16, 8(R1) DADDUI R1, R1, # -32 S.D BNE R1,R2, Loop

Static Vs. Dynamic New iterations completed 1 Dynamic ILP Cycles Static ILP Cycles What if I doubled the number of resources in each processor? What if I unrolled the loop and executed it on a dynamic ILP processor?

Static vs. Dynamic Dynamic: because of the loop index, at most one iteration can start every cycle – even fewer if there are resource constraints – in other words, we have a pipeline that has a throughput of one iteration per cycle! Static: by eliminating loop index, each iteration is independent  as many loops can start in a cycle as there are resources – however, after a while, we don’t start any more iterations – thus, loop unrolling provides a brief steady state, where an iteration starts/finishes every cycle and the rest is start-up/wind-down for each unrolled loop

… … Software Pipeline?! L.D ADD.D S.D DADDUI BNE L.D ADD.D S.D DADDUI

Software Pipelining Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1,# -8 BNE R1, R2, Loop Loop: S.D F4, 16(R1) ADD.D F4, F0, F2 L.D F0, 0(R1) DADDUI R1, R1,# -8 BNE R1, R2, Loop Advantages: achieves nearly the same effect as loop unrolling, but without the code expansion – an unrolled loop may have inefficiencies at the start and end of each iteration, while a sw-pipelined loop is almost always in steady state – a sw-pipelined loop can also be unrolled to reduce loop overhead Disadvantages: does not reduce loop overhead, may require more registers

Midterm Exam Show up early! Time will be a constraint Attempt all questions – grading will be lenient Open books and notes

Title Bullet