1 Advanced Computer Architecture Limits to ILP Lecture 3.

Slides:

Advertisements

Similar presentations

CS136, Advanced Architecture Limits to ILP Simultaneous Multithreading.

Advertisements

CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

HW 2 is out! Due 9/25!. CS 6290 Static Exploitation of ILP.

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.

1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)

ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:

CS 6461: Computer Architecture Basic Compiler Techniques for Exposing ILP Instructor: Morris Lancaster Corresponding to Hennessey and Patterson Fifth Edition.

Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.

Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 Instruction-Level Parallelism and Its Exploitation Computer Architecture A Quantitative.

Advanced Pipelining Optimally Scheduling Code Optimally Programming Code Scheduling for Superscalars (6.9) Exceptions (5.6, 6.8)

1 Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

ILP: Loop UnrollingCSCE430/830 Instruction-level parallelism: Loop Unrolling CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.

EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.

EEL Advanced Pipelining and Instruction Level Parallelism Lotzi Bölöni.

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

CS152 Lec15.1 Advanced Topics in Pipelining Loop Unrolling Super scalar and VLIW Dynamic scheduling.

Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.

Chapter 4 Advanced Pipelining and Intruction-Level Parallelism Computer Architecture A Quantitative Approach John L Hennessy & David A Patterson 2 nd Edition,

Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.

1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)

Chapter 4 CSF 2009 The processor: Instruction-Level Parallelism.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.

Instruction Level Parallelism (ILP) Colin Stevens.

EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.

EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.

1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.

Chapter 2 Instruction-Level Parallelism and Its Exploitation

Pipelined Processor II CPSC 321 Andreas Klappenecker.

EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.

7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Architecture Basics ECE 454 Computer Systems Programming

Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.

Instruction Level Parallelism Pipeline with data forwarding and accelerated branch Loop Unrolling Multiple Issue -- Multiple functional Units Static vs.

Recap Multicycle Operations –MIPS Floating Point Putting It All Together: the MIPS R4000 Pipeline.

Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.

IA-64 Architecture Muammer YÜZÜGÜLDÜ CMPE /12/2004.

现代计算机体系结构主讲教师：张钢天津大学计算机学院 2009 年.

Use of Pipelining to Achieve CPI < 1

1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.

CS 352H: Computer Systems Architecture

Dynamic Scheduling Why go out of style?

Instruction Level Parallelism

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

Simultaneous Multithreading

CS203 – Advanced Computer Architecture

CC 423: Advanced Computer Architecture Limits to ILP

CS203 – Advanced Computer Architecture

Advantages of Dynamic Scheduling

Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2) HW3 posted, due in a week.

Sampoorani, Sivakumar and Joshua

Instruction Level Parallelism (ILP)

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

CSC3050 – Computer Architecture

How to improve (decrease) CPI

Lecture 5: Pipeline Wrap-up, Static ILP

Presentation transcript:

1 Advanced Computer Architecture Limits to ILP Lecture 3

2 Factors limiting ILP Rename registers Reservation stations, reorder buffer, rename file Branch prediction Jump prediction Harder than branches Memory aliasing Window size Issue width

3 ILP in a “perfect” processor

4 Window size limits How far ahead can you look? 50 instruction window 2,450 register comparisons looking for RAW, WAR, WAW Assuming only register-register operations 2,000 instruction window ~4M comparisons Commercial machines: window size up to 128

5 Window size effects Assuming all instructions take 1 cycle, Assuming perfect branch prediction

6 Window size effects, limited issue Limited to 64 issues per clock (note: current limits closer to 6

7 Branch prediction effects Assuming 2K instruction window

8 How much misprediction affects ILP? Not much

9 Register limitations Current processors near 256 – not much left!

10 Memory disambiguation Memory aliasing Like RAW, WAW, WAR, but for loads and stores Even compiler can’t always know Indirection, base address changes, pointers How to avoid? Don’t: do everything in order Speculate: fix up later Value prediction

11 Alias analysis Current compilers somewhere between Inspection and Global/stack perfect

12 Instruction issue complexity “Classic” RISC instructions Only 32-bit instructions Current RISC Most have 16-bit as well as 32-bit forms x86 1 to 17 bytes per instruction

13 Tradeoff More instruction issue More gates in decode – lower clock rate More pipe stages in decode – more branch penalty More instruction execution More register ports – lower clock rate More comparisons – lower clock rate or more pipe stages Chip area grows faster than performance?

14 Help from the compiler Standard FP loop: Loop:L.DF0,0(R1) ADD.DF4,F0,F2 S.DF4,0(R1) DADDUIR1,R1,#-8 BNER1,R2,Loop 1 cycle load stall 2 cycle execute stall 1 cycle branch stall

15 Same code, reordered Loop:L.DF0,0(R1) DADDUIR1,R1,#-8 ADD.DF4,F0,F2 BNER1,R2,Loop S.DF4,0(R1) in load delay slot in branch delay slot 1 cycle execute stall Saves 3 cycles per loop

16 Loop Unrolling In previous loop 3 instructions do “work” 2 instructions “overhead” (loop control) Can we minimize overhead? Do more “work” per loop Repeat loop body multiple times for each counter/branch operation

17 Unrolled Loop Loop:L.DF0,0(R1) ADD.DF4,F0,F2 S.DF4,0(R1) L.DF0,-8(R1) ADD.DF4,F0,F2 S.DF4,-8(R1) L.DF0,-16(R1) ADD.DF4,F0,F2 S.DF4,-16(R1) L.DF0,-24(R1) ADD.DF4,F0,F2 S.DF4,-24(R1) DADDUIR1,R1,#-32 BNER1,R2,Loop

18 Tradeoffs Fewer ADD/BR instructions 1/n for n unrolls Code expansion Nearly n times for n unrolls Prologue Most loops aren’t 0 mod n iterations Need a loop start (or end) of single iterations from 0 to n-1

19 Unrolled loop, dependencies minimized Loop:L.DF0,0(R1) ADD.DF4,F0,F2 S.DF4,0(R1) L.DF6,-8(R1) ADD.DF8,F6,F2 S.DF8,-8(R1) L.DF10,-16(R1) ADD.DF12,F10,F2 S.DF12,-16(R1) L.DF14,-24(R1) ADD.DF16,F14,F2 S.DF16,-24(R1) DADDUIR1,R1,#-32 BNER1,R2,Loop This is what register renaming does in hardware

20 Unrolled loop, reordered Loop:L.DF0,0(R1) L.DF6,-8(R1) L.DF10,-16(R1) L.DF14,-24(R1) ADD.DF4,F0,F2 ADD.DF8,F6,F2 ADD.DF12,F10,F2 ADD.DF16,F14,F2 S.DF4,0(R1) S.DF8,-8(R1) DADDUIR1,R1,#-32 S.DF12,16(R1) BNER1,R2,Loop S.DF16,8(R1) This is what dynamic scheduling does in hardware

21 Limitations on software rescheduling Register pressure Run out of architectural registers Why Itanium has 128 registers Data path pressure How many parallel loads? Interaction with issue hardware Can hardware see far enough ahead to notice? One reason there are different compilers for each processor implementation Compiler sees dependencies Hardware sees hazards

22 Superscalar Issue Integer pipeline FP pipeline Loop: L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) ADD.D F4,F0,F2 L.D F14,-24(R1) ADD.D F8,F6,F2 L.D F18,-32(R1) ADD.D F12,F10,F2 S.D F4,0(R1) ADD.D F16,F14,F2 S.D F8,-8(R1) ADD.D F20,F18,F2 S.D F12,-16(R1) DADDUI R1,R1,# -40 S.D F16,16(R1) BNE R1,R2,Loop S.D F20,8(R1) Can the compiler tell the hardware to issue in parallel?

23 Limit on loop unrolling Keeping the pipeline filled Each iteration is load-execute-store Pipeline has to wait – at least at start of loop Load delays Dependencies Code explosion Cache effects

24 Loop dependencies for(i=0;i<1000;i++) { A[i] = B[i]; } for(i=0;i<1000;i++) { A[i] = B[i] + x; B[i+1] = A[i] + B[i]; } Embarrassingly parallel Iteration i depends on i-1

25 Software pipelining Separate program loop from data loop Each iteration has different phases of execution Each iteration of original loop occurs in 2 or more iterations of pipelined loop Analogous to Tomasulo Reservation stations hold different iterations of loop

26 Software pipelining Iteration 0 Iteration 1 Iteration 2 Iteration 3 Iteration -2 Iteration -1 Iteration 0 Iteration 1 Iteration 2

27 Compiling for Software Pipeline for(i=0; i<1000; i++ ) { x = load(a[i]); y = calculate(x); b[i] = y; } x = load(a[0]); y = calculate(x); x = load(a[0]); for(i=1; i<1000; i++ ) { b[i-1] = y; y = calculate(x); x = load(a[i+1]); } Original Pipelined

28 Unrolling vs. Software pipelining Loop unrolling minimizes branch overhead Software pipelining minimizes dependencies Especially useful for long latency loads/stores No idle cycles within iterations Can combine unrolling with software pipelining Both techniques are standard practice in high performance compilers

29 Static Branch Prediction Some of the benefits of dynamic techniques Delayed branch Compiler guesses for(i=0;i<10000;i++) is pretty easy Compiler hints Some instruction sets include hints Especially useful for trace-based optimization

30 More branch troubles Even unrolled, loops still have some branches Taken branches can break pipeline – nonsequential fetch Unpredicted branches can break pipeline Especially annoying for short branches a = f(); b = g(); if( a > 0 ) x = a; else x = b;

31 Predication Add a predicate to an instruction Instruction only executes if predicate is true Several approaches flags, registers On every instruction, or just a few (e.g., load predicated) Used in several architectures HP Precision, ARM, IA-64 LDR2,b[] LDR1,a[] ifGTMOVR4,R1 ifLEMOVR4,R2 …

32 Cost/Benefit of Predication Enables longer basic blocks Much easier for compilers to find parallelism Can eliminate instructions Branch folded into instruction Instruction space pressure How to indicate predication? Register pressure May need extra registers for predicates Stalls can still occur Still have to evaluate predicate and forward to predicated instruction