Embedded Computer Architectures Hennessy & Patterson Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation Gerard Smit (Zilverling 4102),

Embedded Computer Architectures Hennessy & Patterson Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation Gerard Smit (Zilverling 4102), smit@cs.utwente.nlsmit@cs.utwente.nl André Kokkeler (Zilverling 4096), kokkeler@utwente.nlkokkeler@utwente.nl

Contents Introduction Hazards <= dependencies Instruction Level Parallelism; Tomasulo’s approach Branch prediction

Dependencies True Data dependency Name dependency —Antidependency —Output dependency Control dependency

Data Dependency Inst i Inst i+1 Inst i+2 Result Data Dep Data Dep Data Dep Two instructions are data dependent => risk of RAW hazard

Name Dependency Antidependence Output dependence Inst i register or memory location Inst j Write Read Two instructions are antidependent => risk of WAR hazard Inst i register or memory location Inst j Write Two instructions are antidependent => risk of WAW hazard

Control Dependency Branch condition determines whether instruction i is executed => i is control dependent on the branch

Instruction Level Parallelism Pipelining = ILP Other approach: Dynamic scheduling => out of order execution Instruction Decode stage split into —Issue (decode, check for structural hazards) —Read Operands

Instruction Level Parallelism Scoreboard: —Sufficient resources —No data dependencies Tomasulo’s approach —Minimize RAW hazards —Register renaming to minimize WAW and RAW hazards issue Read operands Reservation Station (park instructions while waiting for operands)

Tomasulo’s approach Register Renaming 1.Read F0 2.Write F0 3.Read F0 4.Write F0 Register F0 start of instruction register use of instruction Time

Tomasulo’s approach Register Renaming 1.Read F0 2.Write F0 3.Read F0 4.Write F0 Register F0 Time Problems if arrows cross

Tomasulo’s approach Register Renaming 1.Read F0 2.Write F0 3.Read F0 4.Write F0 Register F0 Time Instr 2, 3,… will be stalled. Note that Instr 2 and 3 are stalled only because Instr 1 is not ready. If not for Instr 1, they could be executed earlier

Tomasulo’s approach Register Renaming 1.Read F0 2.Write F0 3.Read F0 4.Write F0 Instr 3.Register F0 Instr 1.Register F0 How is it arranged that value is written into Instr 3. Register F0 and not in Instr 1. Register F0?

Tomasulo’s approach Register Renaming 1.Read F0 2.Write F0 3.Read F0 4.Write F0 Instr 3.Register F0 Instr 3.F0Source Instr 1.Register F0 Instr 1.F0Source Instr. k Instr. 2 The result of Instr 2 is labelled with ‘Instr. 2’. Hardware checks whether there Is an instruction waiting for the result (checking the F0Source fields of instructions) And places the result in the correct place.

Tomasulo’s approach Register Renaming 1.Read F0 2.Write F0 3.Read F0 4.Write F0 Instr 3.Register F0 Instr 3.F0Source Instr. 2 F0DataF0Sourceoperation (read)

Tomasulo’s approach Register Renaming 1.Read F0 2.Write F0 3.Read F0 4.Write F0 F0DataF0Sourceoperation (read) F0DataF0Sourceoperation (read)

Tomasulo’s approach Register Renaming 1.Read F0 2.Write F0 3.Read F0 4.Write F0 F0DataF0Sourceoperation (read) F0DataF0Sourceoperation (read) operation (write) Reservation Station Issue Filled during execution Filled during Issue

Tomasulo’s approach Effects —Register Renaming: prevents WAW and WAR hazards —Execution starts when operands are available (datafields are filled): prevents RAW

Tomasulo’s approach Issue in more detail (issue is done sequentially) 1.Read F0 2.Write F0 3.Read F0 4.Write F0 Empty?????read1 Empty write1 Reservation Station read2 read write datalabeloperationsource Format: This is the only information you have: During issue, you have to keep track which instruction changed F0 last!!!!

Tomasulo’s approach Issue in more detail 1.Read F0 2.Write F0 3.Read F0 4.Write F0 Empty?????read1 EmptyWrite1 write1 Reservation Station read2 write2 read write datalabeloperationsource Format: ???? write1 write2 F0 Keeping track of register status during issue is done for every register

Tomasulo’s approach Definitions for the MIPS —For each reservation station: Name Busy Operation Vj Vk Qj Qk A Name = label Busy = in execution or not Operation= instruction V= operand value Q= operand source A= memory address (Load, Store)

Tomasulo’s approach; hardware view Issue hardware Reservation Station “Execution Control Hardware” Execution Units “Reservation Fill Hardware” Common Data Bus Of which instructions are operands and corresponding execution units available? => Transport operands to executions unit Puts data in correct place in reservation station From instruction queue Register Renaming Fill Reservation Stations Results + identification Of instruction producing the result

Branch prediction Data Hazards => Tomasulo’s approach Branch (control) hazards => Branch prediction —Goal: Resolve outcome of branch early => prevent stalls because of control hazards

Branch prediction; 1 history bit Example: Outerloop:… R=10 Innerloop:… R=R-1 BNZR, Innerloop … BranchOuterloop History bit History bit: is branch taken previously or not:- predict taken: fetch from ‘Innerloop’ - predict not taken: fetch next instr Actual outcome of branch:- taken: set history bit to ‘taken’ - not taken: set history bit to ‘not taken’ In this situation: Correct prediction in 80 % of branch evaluations

Branch prediction; 2 history bits Example: Outerloop:… R=10 Innerloop:… R=R-1 BNZR, Innerloop … BranchOuterloop 2 history bits Predict taken Predict not taken Not taken taken In this application: correct prediction in 90 % of branch evaluations

Branch prediction; Correlating branch predictors If (aa == 2) aa=0; If (bb == 2) bb=0; If (aa != bb) Results of these branches are used in prediction of this branch Example: suppose aa == 2 and bb == 2 then condition for last ‘if’ is always false => if previous two branches are not taken, last branch is taken.

Branch prediction; Correlating branch predictors Mechanism: Suppose result of 3 previous branches is used to influence decision. 8 possible sequences: br-3br-2br-1br NTNTNTT NTNT T NT ….….….…. T T TT Dependent on outcome of branch under consideration prediction is changed: —1 bit history: (3,1) predictor Branch under consideration For the sequence (NT NT NT) the prediction is that the branch will be taken => Fetches from branch destination

Branch prediction; Correlating branch predictors Mechanism: Suppose result of 3 previous branches is used to influence decision. 8 possible sequences: br-3br-2br-1br NTNTNTT NTNT T NT ….….….…. T T TT Dependent on outcome of branch under consideration prediction is changed: —1 bit history: (3,1) predictor —2 bit history: (3,2) predictor Branch under consideration For the sequence (NT NT NT) the prediction is that the branch will be taken => Fetches from branch destination Represented by 2 bits -2 combinations indicate:predict taken -2 combinations indicate: predict non taken Updated by means of statemachine

Branch Target Buffer Solutions: —Delayed Branch —Branch Target buffer Even with a good prediction, we don’t know where to branch too until here and we’ve already retrieved the next instruction

Branch Target Buffer Memory (Instruction cache) Program Counter AddressBranch Target Corresponding Branch Targets Addresses of branch instructions Hit? From Instruction Decode hardware Select After IF stage, branch address already in PC

Branch Folding Memory (Instruction cache) Program Counter AddressInstruction at target Corresponding Instructions at Branch Targets Addresses of branch instructions Hit? Unconditional Branches: Effectively removing Branch instruction (penalty of -1)

Return Address Predictors Indirect branches: branch address known at run time. 80% of time: return instructions. Small fast stack: RET Procedure Call Procedure Return RET

Multiple Issue Processors Goal: Issue multiple instructions in a clockcycle Superscalar issue varying number of instructions per clock —Statically scheduled —Dynamically scheduled VLIW issue fixed number of instructions per clock —Statically scheduled

Multiple Issue Processors Example Instruction type Pipe Stages IntegerIFIDEXMEMWB FPIFIDEX WB IntegerIFIDEXMEMWB FPIFIDEX WB IntegerIFIDEXMEMWB FPIFIDEX WB IntegerIFIDEXMEMWB FPIFIDEX

Hardware Based Speculation Multiple Issue Processors => nearly 1 branch every clock cycle Dynamic scheduling + branch prediction: fetch+issue Dynamic scheduling + branch speculation: fetch+issue+execution KEY: Do not perform updates that cannot be undone until you’re sure the corresponding operation really should be executed.

Hardware Based Speculation Tomasulo: Branch (Predict Not Taken) Register File Operation i Operation k Operations beyond this point are finished Issued Operation k: -Operand available -Execution postponed until clear whether branch is taken

Hardware Based Speculation Tomasulo: Branch (Predict Not Taken) Register File Operation i Operation k Issued Finished Dependent on outcome branch: -Flush reservation stations -Start execution

Hardware Based Speculation Speculation: Branch (Predict Not Taken) Register File Operation i Operation k Results of operations beyond this point are committed (from reorder buffer to register file) Issued Operation k: -Operand available and executed Reorder Buffer Commit: sequentially

Hardware Based Speculation Speculation: Branch (Predict Not Taken) Register File Operation i Operation k Issued Operation k: -Operand available and executed Reorder Buffer Commit: sequentially Committed

Hardware Based Speculation Speculation: Branch (Predict Not Taken) Register File Operation i Operation k Operation k: -Operand available and executed Reorder Buffer Commit: sequentially Committed

Hardware Based Speculation Some aspects —Instructions causing a lot of work should not have been executed => restrict allowed actions in speculative mode —ILP of a program is limited —Realistic branch predictions: easier to implement => less efficient

Pentium Pro Implementation Pentium Family ProcessorYearClock Rate (MHz) L1 Cache (instr, data) L2 Cache (instr, data) Pentium Pro1995100-2008 KB, 8 KB256 KB, 1024 KB Pentium I1998233-45016 KB, 16 KB256 KB, 512 KB Pentium II Xeon 1999400-45016 KB, 16 KB512 KB, 2 MB Celeron1999500-90016 KB, 16 KB128 KB Pentium III1999450-110016 KB, 16 KB256 KB, 512 KB Pentium III Xeon 2000700-90016 KB, 16 KB1 MB, 2 MB

Pentium Pro Implementation I486: CISC => problems with pipelining 2 observations —Translation CISC instructions into sequence of microinstructions —Microinstruction is of equal length Solution: pipelining microinstructions

Pentium Pro Implementation... Jump to Indirect or Execute... Jump to Execute... Jump to Fetch Jump to Op code routine... Jump to Fetch or Interrupt... Jump to Fetch or Interrupt Fetch cycle routine Indirect Cycle routine Interrupt cycle routine Execute cycle begin AND routine ADD routine Note: each micro-program ends with a branch to the Fetch, Interrupt, Indirect or Execute micro-program

Pentium Pro Implementation

All RISC features are implemented on the execution of microinstructions instead of machine instructions —Microinstruction-level pipeline with dynamically scheduled microoperations –Fetch machine instruction (3 stages) –Decode machine instruction into microinstructions (2 stages) –Issue microinstructions (2 stages, register renaming, reorder buffer allocation performed here) –Execute of microinstructions (1 stage, floating point units pipelined, execution takes between 1 and 32 cycles) –Write back (3 stages) –Commit (3 stages) —Superscalar can issue up to 3 microoperations per clock cycle —Reservation stations (20 of them) and multiple functional units (5 of them) —Reorder buffer (40 entries) and speculation used

Pentium Pro Implementation Execution Units have the following stages Integer ALU1 Integer Load3 Integer Multiply4 FP add3 FP multiply5 (partially pipelined –multiplies can start every other cycle) FP divide32 (not pipelined)

Thread-Level Parallelism ILP: on instruction level Thread-Level Parallelism: on a higher level —Server applications —Database queries Thread: has all information (instructions, data, PC register state etc) to allow it to execute —On a separate processer —As a process on a single process.

Thread-Level Parallelism Potentially high efficiency Desktop applications: —Costly to switch to ‘thread-level reprogrammed’ applications. —Thread level parallelism often hard to find => ILP continues to be focus for desktop-oriented processors (for embedded processors, the situation is different)

Embedded Computer Architectures Hennessy & Patterson Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation Gerard Smit (Zilverling 4102),

Similar presentations

Presentation on theme: "Embedded Computer Architectures Hennessy & Patterson Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation Gerard Smit (Zilverling 4102),"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Embedded Computer Architectures Hennessy & Patterson Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation Gerard Smit (Zilverling 4102),

Similar presentations

Presentation on theme: "Embedded Computer Architectures Hennessy & Patterson Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation Gerard Smit (Zilverling 4102),"— Presentation transcript:

Similar presentations

About project

Feedback