CPSC614 Lec 5.1 Instruction Level Parallelism and Dynamic Execution #4: Based on lectures by Prof. David A. Patterson E. J. Kim.

Slides:



Advertisements
Similar presentations
Speculative ExecutionCS510 Computer ArchitecturesLecture Lecture 11 Trace Scheduling, Conditional Execution, Speculation, Limits of ILP.
Advertisements

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Oct 19, 2005 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
Dynamic Branch Prediction (Sec 4.3) Control dependences become a limiting factor in exploiting ILP So far, we’ve discussed only static branch prediction.
CPE 631: ILP, Static Exploitation Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic,
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
CS152 Lec15.1 Advanced Topics in Pipelining Loop Unrolling Super scalar and VLIW Dynamic scheduling.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
CPE 631: Branch Prediction Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic,
Dynamic Branch Prediction
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
Lecture 8: More ILP stuff Professor Alvin R. Lebeck Computer Science 220 Fall 2001.
EEL 5708 Speculation. Branch prediction. Superscalar processors. Lotzi Bölöni.
CS 211: Computer Architecture Lecture 5 Instruction Level Parallelism and Its Dynamic Exploitation Instructor: M. Lancaster Corresponding to Hennessey.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
CPE 731 Advanced Computer Architecture ILP: Part II – Branch Prediction Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
W04S1 COMP s1 Seminar 4: Branch Prediction Slides due to David A. Patterson, 2001.
1 Lecture 7: Static ILP, Branch prediction Topics: static ILP wrap-up, bimodal, global, local branch prediction (Sections )
CPSC614 Lec 5.1 Instruction Level Parallelism and Dynamic Execution #4: Based on lectures by Prof. David A. Patterson E. J. Kim.
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
EECC551 - Shaaban #1 lec # 5 Fall Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction So far we have dealt with.
EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
COMP381 by M. Hamdi 1 Superscalar Processors. COMP381 by M. Hamdi 2 Recall from Pipelining Pipeline CPI = Ideal pipeline CPI + Structural Stalls + Data.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 9, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
Goal: Reduce the Penalty of Control Hazards
Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.
Review of CS 203A Laxmi Narayan Bhuyan Lecture2.
EECC551 - Shaaban #1 lec # 8 Winter Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to.
CIS 629 Fall 2002 Multiple Issue/Speculation Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to utilize.
1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
EENG449b/Savvides Lec /25/05 March 24, 2005 Prof. Andreas Savvides Spring g449b EENG 449bG/CPSC 439bG.
CIS 429/529 Winter 2007 Branch Prediction.1 Branch Prediction, Multiple Issue.
1 Lecture 7: Branch prediction Topics: bimodal, global, local branch prediction (Sections )
ENGS 116 Lecture 91 Dynamic Branch Prediction and Speculation Vincent H. Berk October 10, 2005 Reading for today: Chapter 3.2 – 3.6 Reading for Wednesday:
1 Lecture 7: Static ILP and branch prediction Topics: static speculation and branch prediction (Appendix G, Section 2.3)
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
1 Dynamic Branch Prediction. 2 Why do we want to predict branches? MIPS based pipeline – 1 instruction issued per cycle, branch hazard of 1 cycle. –Delayed.
CSCI 6461: Computer Architecture Branch Prediction Instructor: M. Lancaster Corresponding to Hennessey and Patterson Fifth Edition Section 3.3 and Part.
CSCE 614 Fall Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing.
Branch.1 10/14 Branch Prediction Static, Dynamic Branch prediction techniques.
1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.
CPE 631 Session 17 Branch Prediction Electrical and Computer Engineering University of Alabama in Huntsville.
Csci 136 Computer Architecture II – Superscalar and Dynamic Pipelining Xiuzhen Cheng
CS203 – Advanced Computer Architecture ILP and Speculation.
Use of Pipelining to Achieve CPI < 1
Dynamic Branch Prediction
Instruction-Level Parallelism and Its Dynamic Exploitation
CS 352H: Computer Systems Architecture
CS203 – Advanced Computer Architecture
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
April 2, 2002 Prof. David E. Culler Computer Science 252 Spring 2002
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Lecture 6: Static ILP, Branch prediction
Dynamic Hardware Branch Prediction
CPE 631: Branch Prediction
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
Dynamic Branch Prediction
Control unit extension for data hazards
Lecture 10: Branch Prediction and Instruction Delivery
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Adapted from the slides of Prof
CSC3050 – Computer Architecture
Dynamic Hardware Prediction
CPE 631 Lecture 12: Branch Prediction
Presentation transcript:

CPSC614 Lec 5.1 Instruction Level Parallelism and Dynamic Execution #4: Based on lectures by Prof. David A. Patterson E. J. Kim

CPSC614 Lec 5.2 Correlating Predictors Two-level predictors if (d == 0) d = 1; if (d == 1)

CPSC614 Lec 5.3 initial value of d b1value of d before b2 b

CPSC614 Lec bit Predictor (Initialized to NT) db1 predic b1 action new b1 pr b2 predic b2 action new b2 pr

CPSC614 Lec 5.5 (1,1) Predictor Every branch has two separate prediction bits. –First bit: the prediction if the last branch in the program is not taken. –Second bit: the prediction if the last branch in the program is taken. Write the pair of prediction bits together.

CPSC614 Lec 5.6 Combinations & Meaning Prediction bitsPrediction if not taken Prediction if taken

CPSC614 Lec 5.7 (m,n) Predictor Uses the last m branches to choose from 2 m branch predictors, each of which is an n-bit predictor. Yields higher prediction rates than 2-bit scheme Requires a trivial amount of additional hardware The global history of the most recent m branches are recorded in an m-bit shift register.

CPSC614 Lec 5.8

CPSC614 Lec 5.9 (m,n) Predictor Total number of bits: = 2 m x n x #prediction entries selected by the branch address Examples

CPSC614 Lec 5.10

CPSC614 Lec 5.11 Tournament Predictors Most popular multilevel branch predictors

CPSC614 Lec 5.12 Tournament Predictors By using multiple predictors (one based on global information, one based on local information, and combining them with a selector), it can select the right predictor for the right branch. Alpha –Uses most sophisticated branch predictor as of 2001.

CPSC614 Lec 5.13

CPSC614 Lec 5.14

CPSC614 Lec Branch Prediction Schemes 1.1-bit Branch-Prediction Buffer 2.2-bit Branch-Prediction Buffer 3.Correlating Branch Prediction Buffer 4.Tournament Branch Predictor 5.Branch Target Buffer 6.Integrated Instruction Fetch Units 7.Return Address Predictors

CPSC614 Lec 5.16 Need Address at Same Time as Prediction Branch Target Buffer (BTB): Address of branch index to get prediction AND branch address (if taken) Branch PCPredicted PC =? PC of instruction FETCH Extra prediction state bits Yes: instruction is branch and use predicted PC as next PC No: branch not predicted, proceed normally (Next PC = PC+4)

CPSC614 Lec 5.17 Avoid branch prediction by turning branches into conditionally executed instructions: if (x) then A = B op C else NOP –If false, then neither store result nor cause exception –Expanded ISA of Alpha, MIPS, PowerPC, SPARC have conditional move; PA-RISC can annul any following instr. –IA-64: 64 1-bit condition fields selected so conditional execution of any instruction –This transformation is called “if-conversion” Drawbacks to conditional instructions –Still takes a clock even if “annulled” –Stall if condition evaluated late –Complex conditions reduce effectiveness; condition becomes known late in pipeline x A = B op C Predicated Execution

CPSC614 Lec 5.18 Special Case Return Addresses Register Indirect branch hard to predict address SPEC89 85% such branches for procedure return Since stack discipline for procedures, save return address in small buffer that acts like a stack: 8 to 16 entries has small miss rate

CPSC614 Lec 5.19 Dynamic Branch Prediction Summary Prediction becoming important part of scalar execution Branch History Table: 2 bits for loop accuracy Correlation: Recently executed branches correlated with next branch. –Either different branches –Or different executions of same branches Tournament Predictor: more resources to competitive solutions and pick between them Branch Target Buffer: include branch address & prediction Predicated Execution can reduce number of branches, number of mispredicted branches Return address stack for prediction of indirect jump

CPSC614 Lec 5.20 Getting CPI < 1: Issuing Multiple Instructions/Cycle Vector Processing: Explicit coding of independent loops as operations on large vectors of numbers –Multimedia instructions being added to many processors Superscalar: varying no. instructions/cycle (1 to 8), scheduled by compiler or by HW (Tomasulo) –IBM PowerPC, Sun UltraSparc, DEC Alpha, Pentium III/4 (Very) Long Instruction Words (V)LIW: fixed number of instructions (4-16) scheduled by the compiler; put ops into wide templates (TBD) –Intel Architecture-64 (IA-64) 64-bit address »Renamed: “Explicitly Parallel Instruction Computer (EPIC)” –Will discuss in 2 lectures Anticipated success of multiple instructions lead to Instructions Per Clock cycle (IPC) vs. CPI

CPSC614 Lec 5.21 Superscalar Processors Issue varying numbers of instructions per clock –statically scheduled »using compiler techniques »in-order execution –dynamically scheduled »Tomasulo ’ s algorithm »out-of-order execution

CPSC614 Lec 5.22 Superscalar MIPS: 2 instructions, 1 FP & 1 anything – Fetch 64-bits/clock cycle; Int on left, FP on right – Can only issue 2nd instruction if 1st instruction issues – More ports for FP registers to do FP load & FP op in a pair TypePipeStages Int. instructionIFIDEXMEMWB FP instructionIFIDEXMEMWB Int. instructionIFIDEXMEMWB FP instructionIFIDEXMEMWB Int. instructionIFIDEXMEMWB FP instructionIFIDEXMEMWB Figure 3.24 P.219

CPSC614 Lec 5.23 Multiple Issue Issues issue packet: group of instructions from fetch unit that could potentially issue in 1 clock –If instruction causes structural hazard or a data hazard either due to earlier instruction in execution or to earlier instruction in issue packet, then instruction does not issue –0 to N instruction issues per clock cycle, for N-issue Performing issue checks in 1 cycle could limit clock cycle time: O(n 2 -n) comparisons –=> issue stage usually split and pipelined –1st stage decides how many instructions from within this packet can issue, 2nd stage examines hazards among selected instructions and those already been issued –=> higher branch penalties => prediction accuracy important

CPSC614 Lec 5.24 Multiple Issue Challenges While Integer/FP split is simple for the HW, get CPI of 0.5 only for programs with: –Exactly 50% FP operations AND No hazards If more instructions issue at same time, greater difficulty of decode and issue: –Even 2-scalar => examine 2 opcodes, 6 register specifiers, & decide if 1 or 2 instructions can issue; (N-issue ~O(N 2 -N) comparisons) –Register file: need 2x reads and writes/cycle –Rename logic: must be able to rename same register multiple times in one cycle! For instance, consider 4-way issue: add r1, r2, r3add p11, p4, p7 sub r4, r1, r2  sub p22, p11, p4 lw r1, 4(r4)lw p23, 4(p22) add r5, r1, r2add p12, p23, p4 Imagine doing this transformation in a single cycle! –Result buses: Need to complete multiple instructions/cycle »So, need multiple buses with associated matching logic at every reservation station. »Or, need multiple forwarding paths

CPSC614 Lec 5.25 Dynamic Scheduling in Superscalar The easy way How to issue two instructions and keep in-order instruction issue for Tomasulo? –Assume 1 integer + 1 floating point –1 Tomasulo control for integer, 1 for floating point Issue 2X Clock Rate, so that issue remains in order Only loads/stores might cause dependency between integer and FP issue: –Replace load reservation station with a load queue; operands must be read in the order they are fetched –Load checks addresses in Store Queue to avoid RAW violation –Store checks addresses in Load Queue to avoid WAR,WAW

CPSC614 Lec 5.26 VLIW Processors issue a fixed number of instructions formatted either as one large instruction or as a fixed instruction packet with the parallelism among instructions explicitly indicated by the instruction (EPIC: Explicitly Parallel Instruction Computers). Statically scheduled by the compiler.

CPSC614 Lec 5.27 nameIssue structure Hazard detection SchedulingDistinguishing characteristic Examples Superscalar (static) dynamich/wstaticin-order execution Sun Ultra SPARC II/III Superscalar (dynamic) dynamich/wdynamicsome out-of- order execution IBM Power2 Superscalar (speculative) dynamich/wdynamic w/ speculation out-of-order execution w/ speculation Pentium III/4, MIPS R10K, Alpha 21264, VLIW/LIWstatics/wstaticno hazards between issue packets Trimedia, i860 EPICmostly static mostly s/wmostly staticexplicit dependences marked by compiler Itanium

CPSC614 Lec 5.28 Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing burden. => Speculating on the outcome of branches and executing the program as if the guesses were correct. Hardware Speculation

CPSC614 Lec Key Ideas of Hardware Speculation Dynamic Branch Prediction –Choose which instruction to execute. Speculation –Allow the execution of instructions before the control dependences are resolved (with the ability to undo the effect of an incorrectly speculated sequence). Dynamic Scheduling –Deal with the scheduling of different combinations of basic blocks

CPSC614 Lec 5.30 Examples PowerPC 603/604/G3/G4 MIPS R10000/12000 Intel Pentium II/III/4 Alpha AMD K5/K6/Athlon