Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.

Slides:



Advertisements
Similar presentations
Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.
Advertisements

Computer Organization and Architecture
Real-time Signal Processing on Embedded Systems Advanced Cutting-edge Research Seminar I&III.
A scheme to overcome data hazards
COMP4611 Tutorial 6 Instruction Level Parallelism
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
COMP25212 Advanced Pipelining Out of Order Processors.
CSE 8383 Superscalar Processor 1 Abdullah A Alasmari & Eid S. Alharbi.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Chapter 4 Advanced Pipelining and Intruction-Level Parallelism Computer Architecture A Quantitative Approach John L Hennessy & David A Patterson 2 nd Edition,
Instruction-Level Parallelism (ILP)
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
Lecture 8: More ILP stuff Professor Alvin R. Lebeck Computer Science 220 Fall 2001.
DAP Spr.‘98 ©UCB 1 Lecture 6: ILP Techniques Contd. Laxmi N. Bhuyan CS 162 Spring 2003.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
1 IF IDEX MEM L.D F4,0(R2) MUL.D F0, F4, F6 ADD.D F2, F0, F8 L.D F2, 0(R2) WB IF IDM1 MEM WBM2M3M4M5M6M7 stall.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Instruction Level Parallelism (ILP) Colin Stevens.
Chapter 14 Superscalar Processors. What is Superscalar? “Common” instructions (arithmetic, load/store, conditional branch) can be executed independently.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
COMP381 by M. Hamdi 1 Superscalar Processors. COMP381 by M. Hamdi 2 Recall from Pipelining Pipeline CPI = Ideal pipeline CPI + Structural Stalls + Data.
Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
Chapter 2 Instruction-Level Parallelism and Its Exploitation
Review of CS 203A Laxmi Narayan Bhuyan Lecture2.
1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.
CIS 629 Fall 2002 Multiple Issue/Speculation Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to utilize.
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
EENG449b/Savvides Lec 5.1 1/27/04 January 27, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
OOO execution © Avi Mendelson, 4/ MAMAS – Computer Architecture Lecture 7 – Out Of Order (OOO) Avi Mendelson Some of the slides were taken.
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
1 Lecture 5 Overview of Superscalar Techniques CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading: Textbook, Ch. 2.1 “Complexity-Effective.
Comp Sci pipelining 1 Ch. 13 Pipelining. Comp Sci pipelining 2 Pipelining.
1 Lecture 5: Dependence Analysis and Superscalar Techniques Overview Instruction dependences, correctness, inst scheduling examples, renaming, speculation,
1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.
Recap Multicycle Operations –MIPS Floating Point Putting It All Together: the MIPS R4000 Pipeline.
PART 5: (1/2) Processor Internals CHAPTER 14: INSTRUCTION-LEVEL PARALLELISM AND SUPERSCALAR PROCESSORS 1.
04/03/2016 slide 1 Dynamic instruction scheduling Key idea: allow subsequent independent instructions to proceed DIVDF0,F2,F4; takes long time ADDDF10,F0,F8;
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
CS203 – Advanced Computer Architecture Pipelining Review.
Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
Instruction-Level Parallelism and Its Dynamic Exploitation
Computer Architecture Principles Dr. Mike Frank
Out of Order Processors
CS203 – Advanced Computer Architecture
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Advantages of Dynamic Scheduling
Out of Order Processors
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
Lecture 11: Memory Data Flow Techniques
Instruction Level Parallelism and Superscalar Processors
CS 704 Advanced Computer Architecture
Lecture 23: Static Scheduling for High ILP
How to improve (decrease) CPI
Advanced Computer Architecture
Instruction Level Parallelism (ILP)
CS152 Computer Architecture and Engineering Lecture 16 Compiler Optimizations (Cont) Dynamic Scheduling with Scoreboards.
Lecture 7 Dynamic Scheduling
Presentation transcript:

Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures

2 Instruction Level Parallelism (ILP) Simultaneous execution of multiple instructions. do { Swap = 0; for (I = 0; I<Last; I++) { if (Tab[I] > Tab[I+1]) { Temp = Tab[I]; Tab[I] = Tab[I+1]; Tab[I+1] = Temp; Swap = 1; } } } while (Swap);

3 Barriers to detecting ILP Control dependences Arise due to conditional branches Data dependences Register dependences Memory dependences

4 Branches j = 0; *q = false; while ((*q == false) && (j != 8)) { j = j + 1; *q = false; if ((b[j] == true) && (a[i+j] == true) && (c[i-j + 7] == true)) { x[i] = j; b[j] = false; a[i+j] = false; c[i-j + 7] = false; if ( …. if (b[j]) if (a[i+j]) while ((*q if (c[i-j+7]) x[i] = j;...

5 Frequent Branches Sequence of branch instructions in the dynamic stream separated by at most one non-branch instruction. Dynamic Branches [%]

6 Branch Prediction Accuracy of gshare Prediction Accuracy [%]

7 Memory Dependences Reordering of memory instructions, loads and stores, is not always possible. Store R1, addr Load R2, addr ’ Add R1, R2 Store R5, addr Store R2, addr ’ Load R1, addr ’ Add R1,R3 Load R2, addr ’ Store R1, addr Add R1, R2 If addr!=addr ’ Store R2, addr ’ Load R1, addr ’ Store R5, addr Add R1,R3 If addr!=addr ’

8 Memory Disambiguation

9 Value based Store-set disambiguator

10 Register Dependences True data dependences False data dependences Add R2, R3 Load R2,. Add R1, R2 Load R1,.. Sub R1, R2 Load R1,. Load R3,. Add R2, R3 Load R2,. Add R1, R2 Load R4,.. Sub R4, R2 Load R1,. Load R3,.

11 Window Size vs ILP (issue width = 16)

12 Parallelism Study - ILP in Spec95

13 Conclusions There is ample amount of parallelism to scale the issue width. Very large instruction windows must be implemented. A highly accurate memory disambiguation mechanism is required. Highly accurate branch prediction must be performed. Register dependences should be avoided.

14 Processors Pipelined Advanced Pipelining Superscalars Very Long Instruction Word (VLIW) Multiprocessors/Multicores

15 Pipelined Processors In-order, overlapped execution of instructions. Eg. 5-stage pipeline instruction fetch, decode and register operand fetch, execute, memory operand fetch, and write-back results. FD M WB E FD E M FED M  MIPS R4000 has an 8 stage pipeline.

16 Causes of Pipeline Delays Data dependences - RAW hazards  register bypass and code reordering by the compiler. Register hazards  WAW hazards -instructions may reach the WB stage out-of- order.  No WAR hazards. Branch delays  Compiler fills branch delay slots vs hardware performs branch prediction. Structural hazards due to nonpipelined units. Register writes when multiple instructions reach WB stage at the same time (issue vs retire rate).

17 Advanced Pipelining In-order issue but Out-of-order execution DIVD F0, F2, F4 ADDD F10, F0, F8 SUBD F8, F8, F14 Execute SUBD before ADDD Dynamic scheduling – Scoreboard, Tomasulo ’ s

18 Superscalar Processors Multiple instructions can be issued in each cycle. Speculative Execution is incorporated (commit or discard results).  AMD-K7 is a 9-issue superscalar. F D E WBM FD E M FD E M FD E M FD E M FD E M  PowerPC is a 4-issue superscalar.

19 VLIW Each long instruction contains multiple operations that are executed in parallel. Compiler performs speculation and recovery. FD E WB E E E F D E E E E *Multiflow 500 can issue up to 28 operations in each instruction (instructions can be up to 1024-bits). *Itanium – 128 bit instruction, 3 operations (40-bit), template (8-bits)

20 Control Dependences -Instruction Window Superscalar Hardware branch prediction guides fetching of instructions to fill up the processor ’ s instruction window. VLIW Programs are first profiled. The compiler uses the profiles to trace out likely paths. A trace is a software instruction window. Instructions are issued from the window as they become ready, that is, out-of-order execution is possible. Instruction reordering is performed by the compiler within the trace.

21 Data Dependences - Exploiting ILP Superscalar Memory dependences: HW load-store disambiguation techniques used for enabling out-of- order execution. VLIW Memory dependences: Detected by the compiler using dependency analysis or using address profiling. False register dependences: Avoided using register renaming. True data dependences: Must be honored. Value prediction for out-of- order execution of dependent instructions. False data dependences: Avoided by the compiler through renaming (memory) and register allocation. True data dependences: Are strictly followed. Reordering is possible with HW support.