CSL718 : VLIW - Software Driven ILP

Slides:



Advertisements
Similar presentations
Anshul Kumar, CSE IITD CSL718 : VLIW - Software Driven ILP Hardware Support for Exposing ILP at Compile Time 3rd Apr, 2006.
Advertisements

CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Compiler techniques for exposing ILP
1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)
Loop Unrolling & Predication CSE 820. Michigan State University Computer Science and Engineering Software Pipelining With software pipelining a reorganized.
CS 6461: Computer Architecture Basic Compiler Techniques for Exposing ILP Instructor: Morris Lancaster Corresponding to Hennessey and Patterson Fifth Edition.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.
Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.
1 Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
Anshul Kumar, CSE IITD CS718 : VLIW - Software Driven ILP Introduction 23rd Mar, 2006.
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)
CSC 4250 Computer Architectures November 14, 2006 Chapter 4.Instruction-Level Parallelism & Software Approaches.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Computer Architecture.
EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
EECC551 - Shaaban #1 Spring 2004 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Winter 2003 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
Chapter 2 Instruction-Level Parallelism and Its Exploitation
EECC551 - Shaaban #1 Fall 2002 lec# Floating Point/Multicycle Pipelining in MIPS Completion of MIPS EX stage floating point arithmetic operations.
COMP4211 – Advanced Computer Architectures & Algorithms University of NSW Seminar Presentation Semester Software Approaches to Exploiting Instruction.
1 Lecture 6: Static ILP Topics: loop analysis, SW pipelining, predication, speculation (Section 2.2, Appendix G)
EECC551 - Shaaban #1 Winter 2011 lec# Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
1 Lecture 6: Static ILP Topics: loop analysis, SW pipelining, predication, speculation (Section 2.2, Appendix G) Assignment 2 posted; due in a week.
EECC551 - Shaaban #1 Winter 2002 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.
1 Lecture 6: Static ILP Topics: loop analysis, SW pipelining, predication, speculation (Section 2.2, Appendix G) Please hand in Assignment 1 now Assignment.
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.
Compiler Techniques for ILP
CS203 – Advanced Computer Architecture
CS5100 Advanced Computer Architecture Instruction-Level Parallelism
Instruction-Level Parallelism (ILP)
CSCE430/830 Computer Architecture
Lecture 11: Advanced Static ILP
Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
The University of Adelaide, School of Computer Science
Compiler techniques for exposing ILP (cont)
CS 704 Advanced Computer Architecture
Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)
Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2)
Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)
Siddhartha Chatterjee Spring 2008
Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2) HW3 posted, due in a week.
CS 704 Advanced Computer Architecture
CS 704 Advanced Computer Architecture
Adapted from the slides of Prof
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Instruction Level Parallelism (ILP)
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Dynamic Hardware Prediction
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
How to improve (decrease) CPI
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CMSC 611: Advanced Computer Architecture
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Loop-Level Parallelism
Static Scheduling Techniques
Lecture 5: Pipeline Wrap-up, Static ILP
Presentation transcript:

CSL718 : VLIW - Software Driven ILP Compiler Support for Exposing and Exploiting ILP 1st Apr, 2006 Anshul Kumar, CSE IITD

Code Scheduling for VLIW Objective is to move code around and form packets of concurrently executable instructions Two possibilities: Local work on a straight line piece of code (basic block), i.e., do not go across conditional branches Global code can move across conditional branches Loops need to be tackled in both cases Anshul Kumar, CSE IITD

Pipeline scheduling example for (i=1000; i>0; i--) x[i] = x[i] + s; Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop Anshul Kumar, CSE IITD

Latency due to data hazards Producer instruction Consumer instruction Latency FP ALU op 3 Store double 2 Load double 1 Assume no structural hazards Anshul Kumar, CSE IITD

Straight forward scheduling Loop: L.D F0, 0(R1) 1 stall 2 ADD.D F4, F0, F2 3 stall 4 stall 5 S.D F4, 0(R1) 6 DADDUI R1, R1, #-8 7 stall 8 BNE R1, R2, Loop 9 stall 10 Anshul Kumar, CSE IITD

A better schedule Loop: L.D F0, 0(R1) 1 DADDUI R1, R1, #-8 2 ADD.D F4, F0, F2 3 stall 4 BNE R1, R2, Loop 5 S.D F4, 8(R1) 6 Anshul Kumar, CSE IITD

Loop unrolling Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) 6 DADDUI R1, R1, #-32 BNE R1, R2, Loop 28 28/4=7 Anshul Kumar, CSE IITD

Re-scheduling Loop: L.D F0, 0(R1) L.D F6, -8(R1) L.D F10, -16(R1) ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 8 S.D F4, 0(R1) S.D F8, -8(R1) 10 DADDUI R1, R1, #-32 S.D F12, -16(R1) 12 BNE R1, R2, Loop S.D F16, -24(R1) 14 14/4=3.5 Anshul Kumar, CSE IITD

Limits to unrolling Decrease in amount of loop overhead amortized with each unroll Growth in code size Register renaming leads to register pressure Anshul Kumar, CSE IITD

Scheduling example with 2 issue proc Loop: L.D F0, 0(R1) 1 L.D F6, -8(R1) 2 L.D F10,-16(R1) ADD.D F4, F0, F2 3 L.D F14,-24(R1) ADD.D F8, F6, F2 4 L.D F18,-32(R1) ADD.D F12, F10, F2 5 S.D F4, 0(R1) ADD.D F16, F14, F2 6 S.D F8, -8(R1) ADD.D F20, F18, F2 7 S.D F12,-16(R1) 8 DADDUI R1,R1,#-40 9 S.D F16, 16(R1) 10 BNE R1,R2,Loop 11 S.D F20, 8(R1) 12 Anshul Kumar, CSE IITD

Scheduling example with 5 issue proc L.D F0, 0(R1) L.D F6, -8(R1) L.D F10,-16(R1) L.D F14,-24(R1) L.D F18,-32(R1) L.D F22,-40(R1) ADD.D F4,F0,F2 ADD.D F8,F6,F2 L.D F26,-48(R1) ADD.D F12,F10,F2 ADD.D F16,F14,F2 ADD.D F20,F18,F2 ADD.D F24,F22,F2 S.D F4, 0(R1) S.D F8, -8(R1) ADD.D F28,F26,F2 S.D F12,-16(R1) S.D F16,-24(R1) DADDUI R1,R1,#-56 S.D F20, 24(R1) S.D F24, 16(R1) S.D F28, 8(R1) BNE R1,R2,Loop Anshul Kumar, CSE IITD

Scheduling Results cycles/iteration Straight forward scheduling 10 With instruction re-ordering 6 With loop unrolling (4 times) 7 Unrolling + re-ordering 3.5 Scheduling on 2 issue VLIW 2.5 Scheduling on 5 issue VLIW 1.3 Anshul Kumar, CSE IITD

Loop Level Parallelism Dependences in context of a loop: Dependence within an iteration Dependence across iteration or loop carried dependence Example with no loop carried dependence for (i=1000; i>0; i--) x[i] = x[i] + s; There is dependence within an iteration. Anshul Kumar, CSE IITD

Example with loop carried dependence for (i=1; i<=100; i++){ A[i+1] = A[i] + C[i]; /* S1 */ B[i+1] = B[i] + A[i+1]; /* S2 */ } Assume that arrays are distinct and non-overlapping S1 uses a value of A from previous iteration restricts overlapping of different iterations S2 uses a value of A from the same iteration restricts movement of instructions within iteration Anshul Kumar, CSE IITD

Another example for (i=1; i<=100; i++){ A[i] = A[i] + B[i]; /* S1 */ B[i+1] = C[i] + D[i]; /* S2 */ } S1 uses a value of computed by S2 in the previous iteration. Still, iterations can be parallelized. There is no cycle among dependences. A transformation can remove loop carried dependence. Anshul Kumar, CSE IITD

Transformed loop of previous example A[1] = A[1] + B[1]; for (i=1; i<=99; i++){ B[i+1] = C[i] + D[i]; A[i+1] = A[i+1] + B[i+1]; } Now there is no loop carried dependence. The iterations can be parallelized, preserving dependence within the iteration. Anshul Kumar, CSE IITD

Dependence distance for (i=6; i<=100; i++){ B[i] = B[i-5] + B[i]; } Dependence distance is 5. This gives an opportunity for parallelization in spite of loop carried dependence. Anshul Kumar, CSE IITD

Finding dependence Important for Analysis is complicated by code scheduling loop parallelization removal of false dependences Analysis is complicated by pointers and passing of parameters by reference and consequent potential for aliasing use of complex expressions as indices of arrays Anshul Kumar, CSE IITD

Analysis with affine indices for (i=m; i<=n; i++){ A[ai+b] = A[ci+d] + B[i]; } Can ai+b become equal to ci+d for some values of i within the range m to n? Difficult to determine in general. a,b,c,d may not be known at compile time. These could depend on other loop indices. When a,b,c,d are constants, we can use GCD test. Dependence implies: GCD(c,a) must divide d-b. Anshul Kumar, CSE IITD

Example with affine indices for (i=1; i<=100; i++){ A[4*i+1] = A[6*i+4] + B[i]; } GCD(4,6) is 2 and 4-1 is 3. 2 does not divide 3. Therefore, indices can never have same values. values of 4*i+1 = 5,9,13,17,21,25,29,33,37,41,.... values of 6*i+4 = 10,16,22,28,34,40,46,52,58,.... Sometimes GCD(c,a) my divide d-b, but dependence may not exist. Anshul Kumar, CSE IITD

Reducing impact of dependent computations Copy propagation DADDUI R1,R2,#4 DADDUI R1,R1,#4 Tree height reduction ADD R1,R2,R3 ADD R4,R1,R6 ADD R8,R4,R7 DADDUI R1,R2,#8 (used in loop unrolling) ADD R1,R2,R3 ADD R4,R6,R7 ADD R8,R1,R4 Anshul Kumar, CSE IITD

Software pipelining: symbolic loop unrolling iteration 0 iteration 1 iteration 2 iteration 3 iteration 4 iteration 5 iteration 6 Anshul Kumar, CSE IITD

Software pipelining example Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) DADDUI R1,R1,#-8 BNE R1,R2,Loop Loop: S.D F4,16(R1) ADD.D F4,F0,F2 L.D F0,0(R1) DADDUI R1,R1,#-8 BNE R1,R2,Loop iteration i L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) iteration i+1 L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) iteration i+2 L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) Anshul Kumar, CSE IITD

Difficulties in software pipelining Overheads - increased register requirement Register management required Loop body may be complex It may require several transformations before pipelining Anshul Kumar, CSE IITD

Global code scheduling Consider regions of code which are larger than basic blocks Include multiple basic blocks and conditionals How to move code across branch and join points? Anshul Kumar, CSE IITD

Static scheduling and branch prediction Static branch prediction is helpful with delayed branches static scheduling Methods Fixed prediction Opcode based prediction Address based prediction Profile driven prediction (misprediction ~10%, instructions between mispredictions ~100) Anshul Kumar, CSE IITD

Branch prediction and scheduling LD R1, 0(R2) DSUBU R1, R1, R3 BEQZ R1, L OR R4, R5, R6 DADDU R10, R4, R3 L: DADDU R7, R8, R9 A: move when branch is predicted as not taken and R4 not needed in taken path B: move when branch is predicted as taken and R7 not needed in taken path A B Anshul Kumar, CSE IITD

Global code scheduling When can assignment to B be moved before the comparison? When can assignment to C be moved above the join point? above the comparison? A[i]=A[i]+B[i] A[i]=0? B[i]=... X C[i]=... predicted path Anshul Kumar, CSE IITD

Trace scheduling A[i]=A[i]+B[i] A[i]=0? trace exit B[i]=... C[i]=... trace exit trace entry Trace scheduling A[i]=A[i]+B[i] A[i]=0? B[i]=... C[i]=... trace exit trace entry A[i]=A[i]+B[i] A[i]=0? B[i]=... C[i]=... trace exit trace entry

Region for global scheduling Trace linear path through code (with high probability) multiple entries and exits Superblock linear path with single entry, multiple exits Hyperblock superblock plus internal control flow Treegion tree with single entry, multiple exits Trace-2 loop free region Anshul Kumar, CSE IITD

Trace B1 B2 B3 B4 B5 B6 10 70 30 20 80 90 90 10 B1 30 70 B2 B3 70 30 B4 20 80 B5 20 B6 90 10 Anshul Kumar, CSE IITD

Superblock B1 B2 B3 B4 B5 B6 10 70 30 20 80 90 90 10 B1 30 70 B2 B3 70 30 B4 20 80 B5 20 B6 90 10 Anshul Kumar, CSE IITD

Superblock with tail duplication 10 B1 B2 B3 B4 B5 B6 10 70 30 20 80 90 50.4 39.6 B1 30 70 B2 B3 70 30 B4 B4’ 14 6 56 B5 24 20 B6 B6’ 50.4 5.6 4.4 39.6 Anshul Kumar, CSE IITD

Hyperblock 10 B1 B2 B3 B4 B5 B6 10 70 30 20 80 90 72 18 B1 30 70 B2 B3 70 30 B4 20 80 B5 20 B6 B6’ 72 8 2 18 Anshul Kumar, CSE IITD

Treegion 10 B1 B2 B3 B4 B5 B6 10 70 30 20 80 90 90 B1 30 70 B2 B3 70 30 B4 B4’ 14 6 56 24 B5 B5’ 14 6 B6 B6’ B6” B6”’ 5.6 1.4 2.4 0.6 50.4 12.6 21.6 5.4 Anshul Kumar, CSE IITD