Caltech CS184b Winter2001 -- DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day8:

Slides:

Advertisements

Similar presentations

Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.

Advertisements

Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 4: April 7, 2003 Instruction Level Parallelism ILP.

Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.

Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.

Chapter 4 The Processor CprE 381 Computer Organization and Assembly Level Programming, Fall 2013 Zhao Zhang Iowa State University Revised from original.

Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.

Instruction-Level Parallelism (ILP)

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

1 Lecture 8: Branch Prediction, Dynamic ILP Topics: branch prediction, out-of-order processors (Sections )

Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day6:

Chapter 12 Pipelining Strategies Performance Hazards.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.

Goal: Reduce the Penalty of Control Hazards

Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.

1 Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections )

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 2: Pipeline problems & tricks dr.ir. A.C. Verschueren Eindhoven.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

Lecture 15: Pipelining and Hazards CS 2011 Fall 2014, Dr. Rozier.

1 Appendix A Pipeline implementation Pipeline hazards, detection and forwarding Multiple-cycle operations MIPS R4000 CDA5155 Spring, 2007, Peir / University.

1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.

Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day5:

1 Out-Of-Order Execution (part I) Alexander Titov 14 March 2015.

Branch.1 10/14 Branch Prediction Static, Dynamic Branch prediction techniques.

CS 1104 Help Session IV Five Issues in Pipelining Colin Tan, S

Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.

1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.

Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.

Implementing Precise Interrupts in Pipelined Processors James E. Smith Andrew R.Pleszkun Presented By: Shrikant G.

11 Pipelining Kosarev Nikolay MIPT Oct, Pipelining Implementation technique whereby multiple instructions are overlapped in execution Each pipeline.

Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day7:

Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)

COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.

Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.

LECTURE 10 Pipelining: Advanced ILP. EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls,

1 Lecture: Out-of-order Processors Topics: branch predictor wrap-up, a basic out-of-order processor with issue queue, register renaming, and reorder buffer.

1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.

Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.

Lecture: Out-of-order Processors

CS 352H: Computer Systems Architecture

Computer Organization CS224

Pipeline Implementation (4.6)

Appendix C Pipeline implementation

Morgan Kaufmann Publishers The Processor

Pipelining: Advanced ILP

Morgan Kaufmann Publishers The Processor

Morgan Kaufmann Publishers The Processor

Lecture 6: Advanced Pipelines

Pipelining review.

The processor: Pipelining and Branching

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Pipelining in more detail

Lecture: Out-of-order Processors

Lecture 8: Dynamic ILP Topics: out-of-order processors

15-740/ Computer Architecture Lecture 5: Precise Exceptions

How to improve (decrease) CPI

Advanced Computer Architecture

Control unit extension for data hazards

Control unit extension for data hazards

Dynamic Hardware Prediction

Control unit extension for data hazards

Lecture 9: Dynamic ILP Topics: out-of-order processors

Conceptual execution on a processor which exploits ILP

CS184b: Computer Architecture (Abstractions and Optimizations)

Presentation transcript:

Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day8: January 30, 2000 Exploiting Instruction-Level Parallelism

Caltech CS184b Winter DeHon 2 Today Reducing Control Costs –Branch Prediction –Branch Target Buffer –Conditional Operations –Speculation

Caltech CS184b Winter DeHon 3 Control Flow Previously saw data hazards on control force stalls –for multiple cycles With superscalar, may be issuing multiple instructions per cycle Makes stalls even more expensive –wasting n slots per cycle –e.g. with 7 instructions / branch issue 7 instructions, hit branch, stall for instructions to complete...

Caltech CS184b Winter DeHon 4 Control/Branches Cannot afford to stall on branches for resolution Limit parallelism to basic block –average run length between branches

Caltech CS184b Winter DeHon 5 Idea Predict branch direction Execute as if that’s correct Commit/discard results after know branch direction Use ideas seen for precise exceptions to separate –working values –architecture state

Caltech CS184b Winter DeHon 6 Goal Correctly predicted branches run as if they weren’t there (noops) Maximize the expected run length between mis-predicted branches

Caltech CS184b Winter DeHon 7 Expected Run Length E(l) = L1 + L2*P1+L3*P1*P2+L4*P1*P2*P3 Li=l, Pi=P E(l)=l(1+p+p 2 +p 3 +…) E(l)=l/(1-p) E(l) = 1/(probability of mispredict)

Caltech CS184b Winter DeHon 8 Expected Run Length P=0.910l p=0.9520l p= l p= l Halving mispredict rate, doubles run length

Caltech CS184b Winter DeHon 9 IPC Run for E(l) instructions Then mispredict –waste ~ pipeline delay cycles (and all work) Pipe delay: d Base IPC: n E(l)/n cycles issue n d cycles issue nothing useful IPC=E(l)/(E(l)/n+d)=n/(1+dn/E(l))

Caltech CS184b Winter DeHon 10 Branch Prediction Previous runs (dynamic) History Correlated

Caltech CS184b Winter DeHon 11 Previous Run Hypothesis: branch behavior is largely a characteristic of the program. –Can use data from previous runs (different input data set) –to predict branch behavior Fisher: Instructions/mispredict: –even with different data sets

Caltech CS184b Winter DeHon 12 Data Prediction Example shows value (and validity) of feedback –run program –measure behavior –use to inform compiler, generate better code Static/procedural analysis –often cannot yield enough information –most behavioral properties are undecidable

Caltech CS184b Winter DeHon 13 Branch History Table Hypothesis: we can predict the behavior of a branch based on it’s recent past behavior. –If a branch has been taken, we’ll predict it’s taken this time. To exploit dynamic strength, would like to be responsive to changing program behavior.

Caltech CS184b Winter DeHon 14 Branch History Table Implementation –Saturating counter count up branch taken; down on branch not taken –Predict direction based on majority (which side of mid-point) of past branches Saturation –keeps counter small (cheap) typically 2b –limits amount of history kept time to “learn” new behavior

Caltech CS184b Winter DeHon 15 Correlated Branch Prediction Hypothesis: branch directions are highly correlated –a branch is likely to depend on branches which occurred before it. Implementation –look at last m branches shift register on branch directions –use a separate counter to track each of the 2 m cases –contain cost: only keep a small number of entries and allow aliasing

Caltech CS184b Winter DeHon 16 Branch Prediction …whole host of schemes and variations proposed in literature

Caltech CS184b Winter DeHon 17 Prediction worked for Direction... Note: –have to decode instruction to figure out if it’s a branch –then have to perform arithmetic to get branch target So, don’t know where the branch is going until cycle (or several) after fetch –IF ID EX

Caltech CS184b Winter DeHon 18 Branch-Target Buffer Take it one step back and predict target address Cache –in parallel with Memory Fetch (IF) –stores predicted target PC and branch prediction –tagged with PC to avoid aliasing

Caltech CS184b Winter DeHon 19 Reducing Number of Branching A mispredicted branch costs more than a few cycles in these wide-issue machines –potentially n*d Especially in cases of reconvergent flow and even branch probabilities

Caltech CS184b Winter DeHon 20 Conditional Operations Idea: create guarded operations –only change register if some result holds e.g. –8b saturating add c=a+b if (t1>255) c=255 if (t1<0) c=0 ADD R1,R2,R3 SUB R4,R1,#255 CMOVP R1,#255,R4 COMVN R1,#0,R1

Caltech CS184b Winter DeHon 21 Conditional Operation Prospect For unpredictable branch (p~=0.5) –E(wasted issue slots) = p*n*d (n*d/2) With conditional move –assume l cycles inside conditional clauses –one sided if: E(wasted) = p*l (l/2) –two sided (both length l) E(wasted) = l Net benefit for short guarded blocks –on wide issue machines

Caltech CS184b Winter DeHon 22 Speculation Branch prediction allows us to continue executing still have to deal with branch being wrong In simple pipelined ISA –outstanding branch resolved before writeback

Caltech CS184b Winter DeHon 23 Speculation Wide-issue ISA? –Likely to have more instructions in flight than mean latency between branches (nd>l) –to exploit parallelism, need to continue computing assuming the chosen path is correct means making result values visible to subsequent instructions which may be wrong if control flow goes another way

Caltech CS184b Winter DeHon 24 Old Problem Mostly the same problem as precise exceptions –want to continue computing forward with tenative values –want to preserve old state so can roll-back to known state

Caltech CS184b Winter DeHon 25 Revisit Re-Order IFID Reorder Bypass EX ALU MPY LD/ST RF Complex (big) bypass logic.

Caltech CS184b Winter DeHon 26 Speculation and Re-Order Buffer Compute and bypass values from re-order buffer At end of re-order buffer –commit to RF (architetural state) in proper program order after branches resolved –if branch wrong, nullify it’s effect (results predicated upon) –flush re-order buffer (pipeline) direct control flow back to correct branch direction

Caltech CS184b Winter DeHon 27 Details As before, – exception delivery must be deferred until can commit instruction –memory operations require reorder/bypass/commit as well History/Future File work –…but transfer time may be more critical in this case

Caltech CS184b Winter DeHon 28 Register Update Unit (RUU) Simplescalar uses this FIFO unit for instruction management serves for both issue and in-order commit Decode: fills empty slots Issue: picks next set of runnable instructions Execution results return here Commit: completed instructions in order from head of FIFO

Caltech CS184b Winter DeHon 29 RUU IF Decode Queue EX ALU MPY LD/ST RF RUU

Caltech CS184b Winter DeHon 30 RUU Needs to hold all outstanding instructions –from: considering for issue –to: completion and final RF writeback Needs to be relatively large Complex?

Caltech CS184b Winter DeHon 31 Reading Thursday: available ILP, costs –finish HP4 (at least 4.7) –quantifying Tuesday: VLIW –Fisher/VLIW and retrospective

Caltech CS184b Winter DeHon 32 Big Ideas Interruptions in Control Flow limit our ability to exploit parallelism There is structure in programs –predictability in control flow Make the common case fast Predict/guess common case control flow –to generate larger blocks Nullify effects of erroneous instructions when guess wrong