1 Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2)

Slides:

Advertisements

Similar presentations

Instruction-Level Parallelism

Advertisements

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

HW 2 is out! Due 9/25!. CS 6290 Static Exploitation of ILP.

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.

Compiler techniques for exposing ILP

1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)

ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:

CS 6461: Computer Architecture Basic Compiler Techniques for Exposing ILP Instructor: Morris Lancaster Corresponding to Hennessey and Patterson Fifth Edition.

Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.

CPE 631: ILP, Static Exploitation Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic,

1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.

Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.

1 Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

Anshul Kumar, CSE IITD CS718 : VLIW - Software Driven ILP Introduction 23rd Mar, 2006.

Eliminating Stalls Using Compiler Support. Instruction Level Parallelism gcc 17% control transfer –5 instructions + 1 branch –Reordering among 5 instructions.

ILP: Loop UnrollingCSCE430/830 Instruction-level parallelism: Loop Unrolling CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.

EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

CS152 Lec15.1 Advanced Topics in Pipelining Loop Unrolling Super scalar and VLIW Dynamic scheduling.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Lecture 3: Chapter 2 Instruction Level Parallelism Dr. Eng. Amr T. Abdel-Hamid CSEN 601 Spring 2011 Computer Architecture Text book slides: Computer Architec.

1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)

CPE432 Chapter 4C.1Dr. W. Abu-Sufah, UJ Chapter 4C: The Processor, Part C Read Section 4.10 Parallelism and Advanced Instruction-Level Parallelism Adapted.

1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.

1 Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)

EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

EECC551 - Shaaban #1 Spring 2004 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.

EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Winter 2003 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.

1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.

Chapter 2 Instruction-Level Parallelism and Its Exploitation

1 Lecture 6: Static ILP Topics: loop analysis, SW pipelining, predication, speculation (Section 2.2, Appendix G)

EECC551 - Shaaban #1 Winter 2011 lec# Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.

1 Lecture 6: Static ILP Topics: loop analysis, SW pipelining, predication, speculation (Section 2.2, Appendix G) Assignment 2 posted; due in a week.

EECC551 - Shaaban #1 Winter 2002 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.

1 Lecture 6: Static ILP Topics: loop analysis, SW pipelining, predication, speculation (Section 2.2, Appendix G) Please hand in Assignment 1 now Assignment.

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.

Instruction Rescheduling and Loop-Unroll Department of Computer Science Southern Illinois University Edwardsville Fall, 2015 Dr. Hiroshi Fujinoki

1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.

Compiler Techniques for ILP

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

CS5100 Advanced Computer Architecture Instruction-Level Parallelism

CSCE430/830 Computer Architecture

Lecture 11: Advanced Static ILP

CSL718 : VLIW - Software Driven ILP

Lecture 6: Static ILP, Branch prediction

Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

Computer Architecture

Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)

Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2)

Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)

Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2) HW3 posted, due in a week.

CS 704 Advanced Computer Architecture

Adapted from the slides of Prof

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

CSC3050 – Computer Architecture

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Lecture 5: Pipeline Wrap-up, Static ILP

Presentation transcript:

1 Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2)

2 Loop Example for (i=1000; i>0; i--) x[i] = x[i] + s; Loop: L.D F0, 0(R1) ; F0 = array element ADD.D F4, F0, F2 ; add scalar S.D F4, 0(R1) ; store result DADDUI R1, R1,# -8 ; decrement address pointer BNE R1, R2, Loop ; branch if R1 != R2 NOP Source code Assembly code Loop: L.D F0, 0(R1) ; F0 = array element stall ADD.D F4, F0, F2 ; add scalar stall S.D F4, 0(R1) ; store result DADDUI R1, R1,# -8 ; decrement address pointer stall BNE R1, R2, Loop ; branch if R1 != R2 stall 10-cycle schedule LD -> any : 1 stall FPALU -> any: 3 stalls FPALU -> ST : 2 stalls IntALU -> BR : 1 stall

3 Smart Schedule By re-ordering instructions, it takes 6 cycles per iteration instead of 10 We were able to violate an anti-dependence easily because an immediate was involved Loop overhead (instrs that do book-keeping for the loop): 2 Actual work (the ld, add.d, and s.d): 3 instrs Can we somehow get execution time to be 3 cycles per iteration? Loop: L.D F0, 0(R1) stall ADD.D F4, F0, F2 stall S.D F4, 0(R1) DADDUI R1, R1,# -8 stall BNE R1, R2, Loop stall Loop: L.D F0, 0(R1) DADDUI R1, R1,# -8 ADD.D F4, F0, F2 stall BNE R1, R2, Loop S.D F4, 8(R1) LD -> any : 1 stall FPALU -> any: 3 stalls FPALU -> ST : 2 stalls IntALU -> BR : 1 stall

4 Problem 1 for (i=1000; i>0; i--) x[i] = y[i] * s; Loop: L.D F0, 0(R1) ; F0 = array element MUL.D F4, F0, F2 ; multiply scalar S.D F4, 0(R2) ; store result DADDUI R1, R1,# -8 ; decrement address pointer DADDUI R2, R2,#-8 ; decrement address pointer BNE R1, R3, Loop ; branch if R1 != R3 NOP Source code Assembly code LD -> any : 1 stall FPMUL -> any: 5 stalls FPMUL -> ST : 4 stalls IntALU -> BR : 1 stall How many cycles do the default and optimized schedules take?

5 Problem 1 for (i=1000; i>0; i--) x[i] = y[i] * s; Loop: L.D F0, 0(R1) ; F0 = array element MUL.D F4, F0, F2 ; multiply scalar S.D F4, 0(R2) ; store result DADDUI R1, R1,# -8 ; decrement address pointer DADDUI R2, R2,#-8 ; decrement address pointer BNE R1, R3, Loop ; branch if R1 != R3 NOP Source code Assembly code LD -> any : 1 stall FPMUL -> any: 5 stalls FPMUL -> ST : 4 stalls IntALU -> BR : 1 stall How many cycles do the default and optimized schedules take? Unoptimized: LD 1s MUL 4s SD DA DA BNE 1s cycles Optimized: LD DA MUL DA 2s BNE SD -- 8 cycles

6 Loop Unrolling Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) L.D F6, -8(R1) ADD.D F8, F6, F2 S.D F8, -8(R1) L.D F10,-16(R1) ADD.D F12, F10, F2 S.D F12, -16(R1) L.D F14, -24(R1) ADD.D F16, F14, F2 S.D F16, -24(R1) DADDUI R1, R1, #-32 BNE R1,R2, Loop Loop overhead: 2 instrs; Work: 12 instrs How long will the above schedule take to complete?

7 Scheduled and Unrolled Loop Loop: L.D F0, 0(R1) L.D F6, -8(R1) L.D F10,-16(R1) L.D F14, -24(R1) ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 S.D F4, 0(R1) S.D F8, -8(R1) DADDUI R1, R1, # -32 S.D F12, 16(R1) BNE R1,R2, Loop S.D F16, 8(R1) Execution time: 14 cycles or 3.5 cycles per original iteration LD -> any : 1 stall FPALU -> any: 3 stalls FPALU -> ST : 2 stalls IntALU -> BR : 1 stall

8 Loop Unrolling Increases program size Requires more registers To unroll an n-iteration loop by degree k, we will need (n/k) iterations of the larger loop, followed by (n mod k) iterations of the original loop

9 Automating Loop Unrolling Determine the dependences across iterations: in the example, we knew that loads and stores in different iterations did not conflict and could be re-ordered Determine if unrolling will help – possible only if iterations are independent Determine address offsets for different loads/stores Dependency analysis to schedule code without introducing hazards; eliminate name dependences by using additional registers

10 Problem 2 for (i=1000; i>0; i--) x[i] = y[i] * s; Loop: L.D F0, 0(R1) ; F0 = array element MUL.D F4, F0, F2 ; multiply scalar S.D F4, 0(R2) ; store result DADDUI R1, R1,# -8 ; decrement address pointer DADDUI R2, R2,#-8 ; decrement address pointer BNE R1, R3, Loop ; branch if R1 != R3 NOP Source code Assembly code LD -> any : 1 stall FPMUL -> any: 5 stalls FPMUL -> ST : 4 stalls IntALU -> BR : 1 stall How many unrolls does it take to avoid stall cycles?

11 Problem 2 for (i=1000; i>0; i--) x[i] = y[i] * s; Loop: L.D F0, 0(R1) ; F0 = array element MUL.D F4, F0, F2 ; multiply scalar S.D F4, 0(R2) ; store result DADDUI R1, R1,# -8 ; decrement address pointer DADDUI R2, R2,#-8 ; decrement address pointer BNE R1, R3, Loop ; branch if R1 != R3 NOP Source code Assembly code LD -> any : 1 stall FPMUL -> any: 5 stalls FPMUL -> ST : 4 stalls IntALU -> BR : 1 stall How many unrolls does it take to avoid stall cycles? Degree 2: LD LD MUL MUL DA DA 1s SD BNE SD Degree 3: LD LD LD MUL MUL MUL DA DA SD SD BNE SD – 12 cyc/3 iterations

12 Superscalar Pipelines Integer pipeline FP pipeline Handles L.D, S.D, ADDUI, BNE Handles ADD.D What is the schedule with an unroll degree of 4?

13 Superscalar Pipelines Integer pipeline FP pipeline Loop: L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) ADD.D F4,F0,F2 L.D F14,-24(R1) ADD.D F8,F6,F2 L.D F18,-32(R1) ADD.D F12,F10,F2 S.D F4,0(R1) ADD.D F16,F14,F2 S.D F8,-8(R1) ADD.D F20,F18,F2 S.D F12,-16(R1) DADDUI R1,R1,# -40 S.D F16,16(R1) BNE R1,R2,Loop S.D F20,8(R1) Need unroll by degree 5 to eliminate stalls The compiler may specify instructions that can be issued as one packet The compiler may specify a fixed number of instructions in each packet: Very Large Instruction Word (VLIW)

14 Problem 3 for (i=1000; i>0; i--) x[i] = y[i] * s; Loop: L.D F0, 0(R1) ; F0 = array element MUL.D F4, F0, F2 ; multiply scalar S.D F4, 0(R2) ; store result DADDUI R1, R1,# -8 ; decrement address pointer DADDUI R2, R2,#-8 ; decrement address pointer BNE R1, R3, Loop ; branch if R1 != R3 NOP Source code Assembly code LD -> any : 1 stall FPMUL -> any: 5 stalls FPMUL -> ST : 4 stalls IntALU -> BR : 1 stall How many unrolls does it take to avoid stalls in the superscalar pipeline?

15 Problem 3 for (i=1000; i>0; i--) x[i] = y[i] * s; Loop: L.D F0, 0(R1) ; F0 = array element MUL.D F4, F0, F2 ; multiply scalar S.D F4, 0(R2) ; store result DADDUI R1, R1,# -8 ; decrement address pointer DADDUI R2, R2,#-8 ; decrement address pointer BNE R1, R3, Loop ; branch if R1 != R3 NOP Source code Assembly code LD -> any : 1 stall FPMUL -> any: 5 stalls FPMUL -> ST : 4 stalls IntALU -> BR : 1 stall How many unrolls does it take to avoid stalls in the superscalar pipeline? LD LD MUL LD MUL 7 unrolls. Could also make do with 5 if we LD MUL moved up the DADDUIs. LD MUL SD MUL

16 Software Pipeline?! L.DADD.DS.D DADDUIBNE L.DADD.DS.D L.DADD.DS.D L.DADD.DS.D L.DADD.D L.DADD.D DADDUIBNE DADDUIBNE DADDUIBNE DADDUIBNE DADDUIBNE … … Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1,# -8 BNE R1, R2, Loop

17 Software Pipeline L.DADD.DS.D L.DADD.DS.D L.DADD.DS.D L.DADD.DS.D L.DADD.DS.D L.DADD.DS.D L.DADD.D L.D Original iter 1 Original iter 2 Original iter 3 Original iter 4 New iter 1 New iter 2 New iter 3 New iter 4

18 Software Pipelining Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1,# -8 BNE R1, R2, Loop Loop: S.D F4, 16(R1) ADD.D F4, F0, F2 L.D F0, 0(R1) DADDUI R1, R1,# -8 BNE R1, R2, Loop Advantages: achieves nearly the same effect as loop unrolling, but without the code expansion – an unrolled loop may have inefficiencies at the start and end of each iteration, while a sw-pipelined loop is almost always in steady state – a sw-pipelined loop can also be unrolled to reduce loop overhead Disadvantages: does not reduce loop overhead, may require more registers

19 Problem 4 for (i=1000; i>0; i--) x[i] = y[i] * s; Loop: L.D F0, 0(R1) ; F0 = array element MUL.D F4, F0, F2 ; multiply scalar S.D F4, 0(R2) ; store result DADDUI R1, R1,# -8 ; decrement address pointer DADDUI R2, R2,#-8 ; decrement address pointer BNE R1, R3, Loop ; branch if R1 != R3 NOP Source code Assembly code LD -> any : 1 stall FPMUL -> any: 5 stalls FPMUL -> ST : 4 stalls IntALU -> BR : 1 stall Show the SW pipelined version of the code and does it cause stalls?

20 Problem 4 for (i=1000; i>0; i--) x[i] = y[i] * s; Loop: L.D F0, 0(R1) ; F0 = array element MUL.D F4, F0, F2 ; multiply scalar S.D F4, 0(R2) ; store result DADDUI R1, R1,# -8 ; decrement address pointer DADDUI R2, R2,#-8 ; decrement address pointer BNE R1, R3, Loop ; branch if R1 != R3 NOP Source code Assembly code LD -> any : 1 stall FPMUL -> any: 5 stalls FPMUL -> ST : 4 stalls IntALU -> BR : 1 stall Show the SW pipelined version of the code and does it cause stalls? Loop: S.D F4, 0(R2) MUL F4, F0, F2 L.D F0, 0(R1) DADDUI R2, R2, #-8 BNE R1, R3, Loop DADDUI R1, R1, #-8 There will be no stalls

21 Title Bullet