Embedded Computer Architectures Hennessy & Patterson Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation Gerard Smit (Zilverling 4102),

Slides:



Advertisements
Similar presentations
Topics Left Superscalar machines IA64 / EPIC architecture
Advertisements

Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.
1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.
Computer Organization and Architecture
CSCI 4717/5717 Computer Architecture
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Real-time Signal Processing on Embedded Systems Advanced Cutting-edge Research Seminar I&III.
A scheme to overcome data hazards
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
COMP25212 Advanced Pipelining Out of Order Processors.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
CS 211: Computer Architecture Lecture 5 Instruction Level Parallelism and Its Dynamic Exploitation Instructor: M. Lancaster Corresponding to Hennessey.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
1 IF IDEX MEM L.D F4,0(R2) MUL.D F0, F4, F6 ADD.D F2, F0, F8 L.D F2, 0(R2) WB IF IDM1 MEM WBM2M3M4M5M6M7 stall.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Cont. Computer Architecture.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
Chapter 14 Superscalar Processors. What is Superscalar? “Common” instructions (arithmetic, load/store, conditional branch) can be executed independently.
1 Lecture 8: Branch Prediction, Dynamic ILP Topics: branch prediction, out-of-order processors (Sections )
Chapter 12 Pipelining Strategies Performance Hazards.
EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.
Review of CS 203A Laxmi Narayan Bhuyan Lecture2.
1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.
1 Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections )
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
1 Lecture 7: Branch prediction Topics: bimodal, global, local branch prediction (Sections )
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
Nov. 9, Lecture 6: Dynamic Scheduling with Scoreboarding and Tomasulo Algorithm (Section 2.4)
1 Dynamic Branch Prediction. 2 Why do we want to predict branches? MIPS based pipeline – 1 instruction issued per cycle, branch hazard of 1 cycle. –Delayed.
CSCE 614 Fall Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing.
Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.
1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.
Recap Multicycle Operations –MIPS Floating Point Putting It All Together: the MIPS R4000 Pipeline.
PART 5: (1/2) Processor Internals CHAPTER 14: INSTRUCTION-LEVEL PARALLELISM AND SUPERSCALAR PROCESSORS 1.
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
CS203 – Advanced Computer Architecture ILP and Speculation.
Instruction-Level Parallelism and Its Dynamic Exploitation
Dynamic Scheduling Why go out of style?
/ Computer Architecture and Design
Out of Order Processors
CS203 – Advanced Computer Architecture
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Introduction to Pentium Processor
Instruction Level Parallelism and Superscalar Processors
CMSC 611: Advanced Computer Architecture
Lecture 6: Advanced Pipelines
A Dynamic Algorithm: Tomasulo’s
Out of Order Processors
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
CS 704 Advanced Computer Architecture
Adapted from the slides of Prof
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
Advanced Computer Architecture
Adapted from the slides of Prof
Dynamic Hardware Prediction
Lecture 9: Dynamic ILP Topics: out-of-order processors
Conceptual execution on a processor which exploits ILP
Presentation transcript:

Embedded Computer Architectures Hennessy & Patterson Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation Gerard Smit (Zilverling 4102), André Kokkeler (Zilverling 4096),

Contents Introduction Hazards <= dependencies Instruction Level Parallelism; Tomasulo’s approach Branch prediction

Dependencies True Data dependency Name dependency —Antidependency —Output dependency Control dependency

Data Dependency Inst i Inst i+1 Inst i+2 Result Data Dep Data Dep Data Dep Two instructions are data dependent => risk of RAW hazard

Name Dependency Antidependence Output dependence Inst i register or memory location Inst j Write Read Two instructions are antidependent => risk of WAR hazard Inst i register or memory location Inst j Write Two instructions are antidependent => risk of WAW hazard

Control Dependency Branch condition determines whether instruction i is executed => i is control dependent on the branch

Instruction Level Parallelism Pipelining = ILP Other approach: Dynamic scheduling => out of order execution Instruction Decode stage split into —Issue (decode, check for structural hazards) —Read Operands

Instruction Level Parallelism Scoreboard: —Sufficient resources —No data dependencies Tomasulo’s approach —Minimize RAW hazards —Register renaming to minimize WAW and RAW hazards issue Read operands Reservation Station (park instructions while waiting for operands)

Tomasulo’s approach Register Renaming 1.Read F0 2.Write F0 3.Read F0 4.Write F0 Register F0 start of instruction register use of instruction Time

Tomasulo’s approach Register Renaming 1.Read F0 2.Write F0 3.Read F0 4.Write F0 Register F0 Time Problems if arrows cross

Tomasulo’s approach Register Renaming 1.Read F0 2.Write F0 3.Read F0 4.Write F0 Register F0 Time Instr 2, 3,… will be stalled. Note that Instr 2 and 3 are stalled only because Instr 1 is not ready. If not for Instr 1, they could be executed earlier

Tomasulo’s approach Register Renaming 1.Read F0 2.Write F0 3.Read F0 4.Write F0 Instr 3.Register F0 Instr 1.Register F0 How is it arranged that value is written into Instr 3. Register F0 and not in Instr 1. Register F0?

Tomasulo’s approach Register Renaming 1.Read F0 2.Write F0 3.Read F0 4.Write F0 Instr 3.Register F0 Instr 3.F0Source Instr 1.Register F0 Instr 1.F0Source Instr. k Instr. 2 The result of Instr 2 is labelled with ‘Instr. 2’. Hardware checks whether there Is an instruction waiting for the result (checking the F0Source fields of instructions) And places the result in the correct place.

Tomasulo’s approach Register Renaming 1.Read F0 2.Write F0 3.Read F0 4.Write F0 Instr 3.Register F0 Instr 3.F0Source Instr. 2 F0DataF0Sourceoperation (read)

Tomasulo’s approach Register Renaming 1.Read F0 2.Write F0 3.Read F0 4.Write F0 F0DataF0Sourceoperation (read) F0DataF0Sourceoperation (read)

Tomasulo’s approach Register Renaming 1.Read F0 2.Write F0 3.Read F0 4.Write F0 F0DataF0Sourceoperation (read) F0DataF0Sourceoperation (read) operation (write) Reservation Station Issue Filled during execution Filled during Issue

Tomasulo’s approach Effects —Register Renaming: prevents WAW and WAR hazards —Execution starts when operands are available (datafields are filled): prevents RAW

Tomasulo’s approach Issue in more detail (issue is done sequentially) 1.Read F0 2.Write F0 3.Read F0 4.Write F0 Empty?????read1 Empty write1 Reservation Station read2 read write datalabeloperationsource Format: This is the only information you have: During issue, you have to keep track which instruction changed F0 last!!!!

Tomasulo’s approach Issue in more detail 1.Read F0 2.Write F0 3.Read F0 4.Write F0 Empty?????read1 EmptyWrite1 write1 Reservation Station read2 write2 read write datalabeloperationsource Format: ???? write1 write2 F0 Keeping track of register status during issue is done for every register

Tomasulo’s approach Definitions for the MIPS —For each reservation station: Name Busy Operation Vj Vk Qj Qk A Name = label Busy = in execution or not Operation= instruction V= operand value Q= operand source A= memory address (Load, Store)

Tomasulo’s approach; hardware view Issue hardware Reservation Station “Execution Control Hardware” Execution Units “Reservation Fill Hardware” Common Data Bus Of which instructions are operands and corresponding execution units available? => Transport operands to executions unit Puts data in correct place in reservation station From instruction queue Register Renaming Fill Reservation Stations Results + identification Of instruction producing the result

Branch prediction Data Hazards => Tomasulo’s approach Branch (control) hazards => Branch prediction —Goal: Resolve outcome of branch early => prevent stalls because of control hazards

Branch prediction; 1 history bit Example: Outerloop:… R=10 Innerloop:… R=R-1 BNZR, Innerloop … BranchOuterloop History bit History bit: is branch taken previously or not:- predict taken: fetch from ‘Innerloop’ - predict not taken: fetch next instr Actual outcome of branch:- taken: set history bit to ‘taken’ - not taken: set history bit to ‘not taken’ In this situation: Correct prediction in 80 % of branch evaluations

Branch prediction; 2 history bits Example: Outerloop:… R=10 Innerloop:… R=R-1 BNZR, Innerloop … BranchOuterloop 2 history bits Predict taken Predict not taken Not taken taken In this application: correct prediction in 90 % of branch evaluations

Branch prediction; Correlating branch predictors If (aa == 2) aa=0; If (bb == 2) bb=0; If (aa != bb) Results of these branches are used in prediction of this branch Example: suppose aa == 2 and bb == 2 then condition for last ‘if’ is always false => if previous two branches are not taken, last branch is taken.

Branch prediction; Correlating branch predictors Mechanism: Suppose result of 3 previous branches is used to influence decision. 8 possible sequences: br-3br-2br-1br NTNTNTT NTNT T NT ….….….…. T T TT Dependent on outcome of branch under consideration prediction is changed: —1 bit history: (3,1) predictor Branch under consideration For the sequence (NT NT NT) the prediction is that the branch will be taken => Fetches from branch destination

Branch prediction; Correlating branch predictors Mechanism: Suppose result of 3 previous branches is used to influence decision. 8 possible sequences: br-3br-2br-1br NTNTNTT NTNT T NT ….….….…. T T TT Dependent on outcome of branch under consideration prediction is changed: —1 bit history: (3,1) predictor —2 bit history: (3,2) predictor Branch under consideration For the sequence (NT NT NT) the prediction is that the branch will be taken => Fetches from branch destination Represented by 2 bits -2 combinations indicate:predict taken -2 combinations indicate: predict non taken Updated by means of statemachine

Branch Target Buffer Solutions: —Delayed Branch —Branch Target buffer Even with a good prediction, we don’t know where to branch too until here and we’ve already retrieved the next instruction

Branch Target Buffer Memory (Instruction cache) Program Counter AddressBranch Target Corresponding Branch Targets Addresses of branch instructions Hit? From Instruction Decode hardware Select After IF stage, branch address already in PC

Branch Folding Memory (Instruction cache) Program Counter AddressInstruction at target Corresponding Instructions at Branch Targets Addresses of branch instructions Hit? Unconditional Branches: Effectively removing Branch instruction (penalty of -1)

Return Address Predictors Indirect branches: branch address known at run time. 80% of time: return instructions. Small fast stack: RET Procedure Call Procedure Return RET

Multiple Issue Processors Goal: Issue multiple instructions in a clockcycle Superscalar issue varying number of instructions per clock —Statically scheduled —Dynamically scheduled VLIW issue fixed number of instructions per clock —Statically scheduled

Multiple Issue Processors Example Instruction type Pipe Stages IntegerIFIDEXMEMWB FPIFIDEX WB IntegerIFIDEXMEMWB FPIFIDEX WB IntegerIFIDEXMEMWB FPIFIDEX WB IntegerIFIDEXMEMWB FPIFIDEX

Hardware Based Speculation Multiple Issue Processors => nearly 1 branch every clock cycle Dynamic scheduling + branch prediction: fetch+issue Dynamic scheduling + branch speculation: fetch+issue+execution KEY: Do not perform updates that cannot be undone until you’re sure the corresponding operation really should be executed.

Hardware Based Speculation Tomasulo: Branch (Predict Not Taken) Register File Operation i Operation k Operations beyond this point are finished Issued Operation k: -Operand available -Execution postponed until clear whether branch is taken

Hardware Based Speculation Tomasulo: Branch (Predict Not Taken) Register File Operation i Operation k Issued Finished Dependent on outcome branch: -Flush reservation stations -Start execution

Hardware Based Speculation Speculation: Branch (Predict Not Taken) Register File Operation i Operation k Results of operations beyond this point are committed (from reorder buffer to register file) Issued Operation k: -Operand available and executed Reorder Buffer Commit: sequentially

Hardware Based Speculation Speculation: Branch (Predict Not Taken) Register File Operation i Operation k Issued Operation k: -Operand available and executed Reorder Buffer Commit: sequentially Committed

Hardware Based Speculation Speculation: Branch (Predict Not Taken) Register File Operation i Operation k Operation k: -Operand available and executed Reorder Buffer Commit: sequentially Committed

Hardware Based Speculation Speculation: Branch (Predict Not Taken) Register File Operation i Operation k Operation k: -Operand available and executed Reorder Buffer Commit: sequentially Committed

Hardware Based Speculation Some aspects —Instructions causing a lot of work should not have been executed => restrict allowed actions in speculative mode —ILP of a program is limited —Realistic branch predictions: easier to implement => less efficient

Pentium Pro Implementation Pentium Family ProcessorYearClock Rate (MHz) L1 Cache (instr, data) L2 Cache (instr, data) Pentium Pro KB, 8 KB256 KB, 1024 KB Pentium I KB, 16 KB256 KB, 512 KB Pentium II Xeon KB, 16 KB512 KB, 2 MB Celeron KB, 16 KB128 KB Pentium III KB, 16 KB256 KB, 512 KB Pentium III Xeon KB, 16 KB1 MB, 2 MB

Pentium Pro Implementation I486: CISC => problems with pipelining 2 observations —Translation CISC instructions into sequence of microinstructions —Microinstruction is of equal length Solution: pipelining microinstructions

Pentium Pro Implementation... Jump to Indirect or Execute... Jump to Execute... Jump to Fetch Jump to Op code routine... Jump to Fetch or Interrupt... Jump to Fetch or Interrupt Fetch cycle routine Indirect Cycle routine Interrupt cycle routine Execute cycle begin AND routine ADD routine Note: each micro-program ends with a branch to the Fetch, Interrupt, Indirect or Execute micro-program

Pentium Pro Implementation

All RISC features are implemented on the execution of microinstructions instead of machine instructions —Microinstruction-level pipeline with dynamically scheduled microoperations –Fetch machine instruction (3 stages) –Decode machine instruction into microinstructions (2 stages) –Issue microinstructions (2 stages, register renaming, reorder buffer allocation performed here) –Execute of microinstructions (1 stage, floating point units pipelined, execution takes between 1 and 32 cycles) –Write back (3 stages) –Commit (3 stages) —Superscalar can issue up to 3 microoperations per clock cycle —Reservation stations (20 of them) and multiple functional units (5 of them) —Reorder buffer (40 entries) and speculation used

Pentium Pro Implementation Execution Units have the following stages Integer ALU1 Integer Load3 Integer Multiply4 FP add3 FP multiply5 (partially pipelined –multiplies can start every other cycle) FP divide32 (not pipelined)

Thread-Level Parallelism ILP: on instruction level Thread-Level Parallelism: on a higher level —Server applications —Database queries Thread: has all information (instructions, data, PC register state etc) to allow it to execute —On a separate processer —As a process on a single process.

Thread-Level Parallelism Potentially high efficiency Desktop applications: —Costly to switch to ‘thread-level reprogrammed’ applications. —Thread level parallelism often hard to find => ILP continues to be focus for desktop-oriented processors (for embedded processors, the situation is different)