Chapter 3 Instruction Level Parallelism 2 Dr. Eng. Amr T. Abdel-Hamid Elect 707 Spring 2014 Computer Applications Text book slides: Computer Architec ture:

Slides:

Advertisements

Similar presentations

Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.

Advertisements

Lec18.1 Step by step for Dynamic Scheduling by reorder buffer Copyright by John Kubiatowicz (http.cs.berkeley.edu/~kubitron)

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Oct 19, 2005 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

Pipelining and Control Hazards Oct

Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.

CSE 502 Graduate Computer Architecture Lec 11 – More Instruction Level Parallelism Via Speculation Larry Wittie Computer Science, StonyBrook University.

Computer Architecture Lec 8 – Instruction Level Parallelism.

Dynamic Branch Prediction

Copyright 2001 UCB & Morgan Kaufmann ECE668.1 Adapted from Patterson, Katz and Kubiatowicz © UCB Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

Tomasulo With Reorder buffer:

CS136, Advanced Architecture Speculation. CS136 2 Outline Speculation Speculative Tomasulo Example Memory Aliases Exceptions VLIW Increasing instruction.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 14, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)

CS 211: Computer Architecture Lecture 5 Instruction Level Parallelism and Its Dynamic Exploitation Instructor: M. Lancaster Corresponding to Hennessey.

/ Computer Architecture and Design Instructor: Dr. Michael Geiger Summer 2014 Lecture 6: Speculation.

CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

DAP Spr.‘98 ©UCB 1 Lecture 6: ILP Techniques Contd. Laxmi N. Bhuyan CS 162 Spring 2003.

CSE 502 Graduate Computer Architecture Lec – More Instruction Level Parallelism Via Speculation Larry Wittie Computer Science, StonyBrook University.

CPE 731 Advanced Computer Architecture ILP: Part II – Branch Prediction Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

W04S1 COMP s1 Seminar 4: Branch Prediction Slides due to David A. Patterson, 2001.

1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Mar 17, 2009 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)

CPSC614 Lec 5.1 Instruction Level Parallelism and Dynamic Execution #4: Based on lectures by Prof. David A. Patterson E. J. Kim.

1 Zvika Guz Slides modified from Prof. Dave Patterson, Prof. John Kubiatowicz, and Prof. Nancy Warter-Perez Out Of Order Execution.

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 9, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)

Goal: Reduce the Penalty of Control Hazards

Review of CS 203A Laxmi Narayan Bhuyan Lecture2.

CIS 629 Fall 2002 Multiple Issue/Speculation Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to utilize.

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

COMP381 by M. Hamdi 1 (Recap) Control Hazards. COMP381 by M. Hamdi 2 Control (Branch) Hazard A: beqz r2, label B: label: P: Problem: The outcome.

Dynamic Branch Prediction

ENGS 116 Lecture 91 Dynamic Branch Prediction and Speculation Vincent H. Berk October 10, 2005 Reading for today: Chapter 3.2 – 3.6 Reading for Wednesday:

1 Overcoming Control Hazards with Dynamic Scheduling & Speculation.

1 Chapter 2: ILP and Its Exploitation Review simple static pipeline ILP Overview Dynamic branch prediction Dynamic scheduling, out-of-order execution Hardware-based.

CSCE 614 Fall Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing.

Professor Nigel Topham Director, Institute for Computing Systems Architecture School of Informatics Edinburgh University Informatics 3 Computer Architecture.

1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.

Anshul Kumar, CSE IITD CSL718 : Superscalar Processors Speculative Execution 2nd Feb, 2006.

CS 5513 Computer Architecture Lecture 6 – Instruction Level Parallelism continued.

CS203 – Advanced Computer Architecture ILP and Speculation.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

Dynamic Branch Prediction

Instruction-Level Parallelism and Its Dynamic Exploitation

/ Computer Architecture and Design

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

COMP 740: Computer Architecture and Implementation

Tomasulo Loop Example Loop: LD F0 0 R1 MULTD F4 F0 F2 SD F4 0 R1

CS203 – Advanced Computer Architecture

CS5100 Advanced Computer Architecture Hardware-Based Speculation

CPSC 614 Computer Architecture Lec 5 – Instruction Level Parallelism

CS152 Computer Architecture and Engineering Lecture 18 Dynamic Scheduling (Cont), Speculation, and ILP.

Tomasulo With Reorder buffer:

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

CS 704 Advanced Computer Architecture

Adapted from the slides of Prof

Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)

Dynamic Branch Prediction

Advanced Computer Architecture

/ Computer Architecture and Design

Larry Wittie Computer Science, StonyBrook University and ~lw

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

CPSC 614 Computer Architecture Lec 5 – Instruction Level Parallelism

Adapted from the slides of Prof

Chapter 3: ILP and Its Exploitation

Adapted from the slides of Prof

Dynamic Hardware Prediction

Overcoming Control Hazards with Dynamic Scheduling & Speculation

Presentation transcript:

Chapter 3 Instruction Level Parallelism 2 Dr. Eng. Amr T. Abdel-Hamid Elect 707 Spring 2014 Computer Applications Text book slides: Computer Architec ture: A Quantitative Approach 4th E dition, John L. Hennessy & David A. Patterso with modifications.

Dr. Amr Talaat Elect 707 Control Hazard on Branches Three Stage Stall 10: beq r1,r3,36 14: and r2,r3,r5 18: or r6,r1,r7 22: add r8,r1,r9 36: xor r10,r1,r11 Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg

Dr. Amr Talaat Elect 707 Solving Branch Problems  Prediction  Static: Software  Dynamic: hardware  Speculation

Dr. Amr Talaat Elect 707 Static Branch Alternatives #1: Stall until branch direction is clear #2: Predict Branch Not Taken  Execute successor instructions in sequence  “Squash” instructions in pipeline if branch actually taken  Advantage of late pipeline state update  47% MIPS branches not taken on average  PC+4 already calculated, so use it to get next instruction #3: Predict Branch Taken  53% MIPS branches taken on average  But haven’t calculated branch target address in MIPS  MIPS still incurs 1 cycle branch penalty  Other solutions: branch target known before outcome

Dr. Amr Talaat Elect 707 Dynamic Branch Prediction  Performance = ƒ(accuracy, cost of misprediction)  Branch History Table: Lower bits of PC address index table of 1-bit values  Says whether or not branch taken last time  No address check  Problem: in a loop, 1-bit BHT will cause two mispredictions (avg is 9 iterations before exit):  End of loop case, when it exits instead of looping as before  First time through loop on next time through code, when it predicts exit instead of looping

Dr. Amr Talaat Elect 707 Branch History Table T Predict Taken Predict not Taken 1 0 T NT 1-bit prediction NT Feedback 01 BHT branch PC

Dr. Amr Talaat Elect 707  Solution: 2-bit scheme where change prediction only if get misprediction twice  Adds hysteresis to decision making process Dynamic Branch Prediction T T NT Predict Taken Predict Not Taken Predict Taken Predict Not Taken T NT T

Dr. Amr Talaat Elect 707 BHT Accuracy  Mispredict because either:  Wrong guess for that branch  Got branch history of wrong branch when index the table  4096 entry table:

Dr. Amr Talaat Elect Correlating Branches Code example showing the potential Assemble code Observation: if BNEZ1 and BNEZ2 is not taken, then BNEZ3 is taken

Dr. Amr Talaat Elect Correlating Branch Predictor Idea: taken/not taken of recently executed branches is related to behavior of next branch (as well as the history of that branch behav ior)  Then behavior of recent branches selects between, say, 2 predictions of next branch, updating just that prediction  (1,1) predictor: 1-bit global, 1-bit local Branch address (4 bits) 1-bits per branch local predictors Prediction 1-bit global branch history (0 = not taken)

Dr. Amr Talaat Elect 707 Correlated Branch Prediction  Idea: record m most recently executed branches as taken or not taken, and use that pattern to select the proper n-bit branch history table  In general, (m,n) predictor means record last m branches to select between 2 m history tables, each with n-bit counters  Thus, old 2-bit BHT is a (0,2) predictor  Global Branch History: m-bit shift register keeping T/NT sta tus of last m branches.  Each entry in table has m n-bit predictors.

Dr. Amr Talaat Elect 707 Correlating Branches (2,2) predictor –Behavior of recent branches selects between four predictio ns of next branch, updating jus t that prediction Branch address 2-bits per branch predictor Prediction 2-bit global branch history 4

Dr. Amr Talaat Elect 707 0%0% Frequency of Mispredictions 0% 1% 5% 6% 11% 4% 6% 5% 1% 2% 4% 6% 8% 10% 12% 14% 16% 18% 20% 4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2) Accuracy of Different Schemes 4096 Entries 2-bit BHT Unlimited Entries 2-bit BHT 1024 Entries (2,2) BHT nasa7 matrix300doducdspicefppppgccexpressoeqntottlitomcatv

Dr. Amr Talaat Elect 707 Loop Example Loop:LDF00R1 SDF40R1 SUBIR1R1#8 MULTDF4F0F2 BNEZF4Loop  very long wait for branch?

Dr. Amr Talaat Elect 707 spec·u·la·tion  spec·u·la·tion : [spek-yuh-ley-shuhn] Show IPA 1. the contemplation or consideration of some subject: toenga ge in speculation on humanity's ultimate destiny. 2. a single instance or process of consideration. 3. a conclusion or opinion reached by such contemplation: T hese speculations are impossible to verify. 4. conjectural consideration of a matter; conjecture or surmis e:a report based on speculation rather than facts. 5. engagement in business transactions involving considerable risk but offering the chance of large gains, especially tradin g in commodities, stocks, etc., in the hope of profit from ch anges in the market price

Dr. Amr Talaat Elect 707 Speculation to greater ILP  Greater ILP: Overcome control dependence by hardwar e speculating on outcome of branches and executing p rogram as if guesses were correct  Dynamic scheduling  only fetches and issues instruction s  Speculation  fetch, issue, and execute instructions as if branch predictions were always correct  Essentially a data flow execution model: Operations exe cute as soon as their operands are available

Dr. Amr Talaat Elect 707 Speculation to greater ILP 3 components of HW-based speculation: 1. Dynamic branch prediction to choose which instructions to execute 2. Speculation to allow execution of instructions before con trol dependences are resolved + ability to undo effects of incorrectly speculated sequence 3. Dynamic scheduling to deal with scheduling of different combinations of basic blocks

Dr. Amr Talaat Elect 707 Adding Speculation to Tomasulo  Must separate execution from allowing instruction to finish or “commit”  This additional step called instruction commit  When an instruction is no longer speculative, allow it to update the register file or memory  Requires additional set of buffers to hold results of in structions that have finished execution but have not committed  This reorder buffer (ROB) is also used to pass results among instructions that may be speculated

Dr. Amr Talaat Elect 707 Reorder Buffer (ROB)  In Tomasulo’s algorithm, once an instruction writes its result, any subsequently issued instructions will find result in the register file  With speculation, the register file is not updated until the instruction commits  (we know definitively that the instruction should execute)  Thus, the ROB supplies operands in interval between completion of instruction execution and instruction commit  ROB is a source of operands for instructions, just as reservation stations (RS) provide operands in Tomasulo’s algorithm  ROB extends architecture registers like RS

Dr. Amr Talaat Elect 707 Reorder Buffer Entry  Each entry in the ROB contains four fields: 1. Instruction type a branch (has no destination result), a store (has a memory address destination), or a register operation (ALU operation or load, which has register destinations) 2. Destination Register number (for loads and ALU operations) or memory address (for stores) where the instruction result should be written 3. Value Value of instruction result until the instruction commits 4. Ready Indicates that instruction has completed execution, and the value is ready

Dr. Amr Talaat Elect 707 Reorder Buffer operation  Holds instructions in FIFO order, exactly as issued  When instructions complete, results placed into ROB  Supplies operands to other instruction between execution complete & commit  more registers like RS  Tag results with ROB buffer number instead of reservation station  Instructions commit  values at head of ROB placed in registers  As a result, easy to undo speculated instructions on mispredicted branches or on exceptions Reorder Buffer FP Op Queue FP Adder Res Stations FP Regs Commit path

Dr. Amr Talaat Elect Steps of Speculative Tomasulo Algorithm 1.Issue—get instruction from FP Op Queue If reservation station and reorder buffer slot free, issue in str & send operands & reorder buffer no. for destination ( this stage sometimes called “dispatch”) 2.Execution—operate on operands (EX) When both operands ready then execute; if not ready, wa tch CDB for result; when both in reservation station, exec ute; checks RAW (sometimes called “issue”) 3.Write result—finish execution (WB) Write on Common Data Bus to all awaiting FUs & reorder buffer; mark reservation station available. 4.Commit—update register with reorder result When instr. at head of reorder buffer & result present, up date register with result (or store to memory) and remove instr from reorder buffer. Mispredicted branch flushes reo rder buffer (sometimes called “graduation”)

Dr. Amr Talaat Elect 707 Tomasulo With Reorder buffer: To Memory FP adders FP multipliers Reservation Stations FP Op Queue ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 F0 LD F0,10(R2) N N Done? Dest Oldest Newest from Memory 1 10+R2 Dest Reorder Buffer Registers

Dr. Amr Talaat Elect ADDD R(F4),ROB1 Tomasulo With Reorder buffer: To Memory FP adders FP multipliers Reservation Stations FP Op Queue ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 F10 F0 ADDD F10,F4,F0 LD F0,10(R2) N N N N Done? Dest Oldest Newest from Memory 1 10+R2 Dest Reorder Buffer Registers

Dr. Amr Talaat Elect DIVD ROB2,R(F6) 2 ADDD R(F4),ROB1 Tomasulo With Reorder buffer: To Memory FP adders FP multipliers Reservation Stations FP Op Queue ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 F2 F10 F0 DIVD F2,F10,F6 ADDD F10,F4,F0 LD F0,10(R2) N N N N N N Done? Dest Oldest Newest from Memory 1 10+R2 Dest Reorder Buffer Registers

Dr. Amr Talaat Elect DIVD ROB2,R(F6) 2 ADDD R(F4),ROB1 6 ADDD ROB5, R(F6) Tomasulo With Reorder buffer: To Memory FP adders FP multipliers Reservation Stations FP Op Queue ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 F0 ADDD F0,F4,F6 N N F4 LD F4,0(R3) N N -- BNE F2, N N F2 F10 F0 DIVD F2,F10,F6 ADDD F10,F4,F0 LD F0,10(R2) N N N N N N Done? Dest Oldest Newest from Memory 1 10+R2 Dest Reorder Buffer Registers 5 0+R3

Dr. Amr Talaat Elect DIVD ROB2,R(F6) 2 ADDD R(F4),ROB1 6 ADDD ROB5, R(F6) Tomasulo With Reorder buffer: To Memory FP adders FP multipliers Reservation Stations FP Op Queue ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 -- F0 ROB5 ST 0(R3),F4 ADDD F0,F4,F6 N N N N F4 LD F4,0(R3) N N -- BNE F2, N N F2 F10 F0 DIVD F2,F10,F6 ADDD F10,F4,F0 LD F0,10(R2) N N N N N N Done? Dest Oldest Newest from Memory Dest Reorder Buffer Registers 1 10+R2 5 0+R3

Dr. Amr Talaat Elect DIVD ROB2,R(F6) Tomasulo With Reorder buffer: To Memory FP adders FP multipliers Reservation Stations FP Op Queue ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 -- F0 M[10] ST 0(R3),F4 ADDD F0,F4,F6 Y Y N N F4 M[10] LD F4,0(R3) Y Y -- BNE F2, N N F2 F10 F0 DIVD F2,F10,F6 ADDD F10,F4,F0 LD F0,10(R2) N N N N N N Done? Dest Oldest Newest from Memory 1 10+R2 Dest Reorder Buffer Registers 2 ADDD R(F4),ROB1 6 ADDD M[10],R(F6)

Dr. Amr Talaat Elect DIVD ROB2,R(F6) 2 ADDD R(F4),ROB1 Tomasulo With Reorder buffer: To Memory FP adders FP multipliers Reservation Stations FP Op Queue ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 -- F0 M[10] ST 0(R3),F4 ADDD F0,F4,F6 Y Y Y Y F4 M[10] LD F4,0(R3) Y Y -- BNE F2, N N F2 F10 F0 DIVD F2,F10,F6 ADDD F10,F4,F0 LD F0,10(R2) N N N N N N Done? Dest Oldest Newest from Memory 1 10+R2 Dest Reorder Buffer Registers

Dr. Amr Talaat Elect F0 M[10] ST 0(R3),F4 ADDD F0,F4,F6 Y Y Y Y F4 M[10] LD F4,0(R3) Y Y -- BNE F2, N N 3 DIVD ROB2,R(F6) 2 ADDD R(F4),ROB1 Tomasulo With Reorder buffer: To Memory FP adders FP multipliers Reservation Stations FP Op Queue ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 F2 F10 F0 DIVD F2,F10,F6 ADDD F10,F4,F0 LD F0,10(R2) N N N N N N Done? Dest Oldest Newest from Memory 1 10+R2 Dest Reorder Buffer Registers

Dr. Amr Talaat Elect 707 Getting CPI below 1  CPI ≥ 1 if issue only 1 instruction every clock cycle  Multiple-issue processors come in 3 flavors: 1. statically-scheduled superscalar processors, 2. dynamically-scheduled superscalar processors, and 3. VLIW (very long instruction word) processors  2 types of superscalar processors issue varying number s of instructions per clock  use in-order execution if they are statically scheduled, or  out-of-order execution if they are dynamically scheduled  VLIW processors, in contrast, issue a fixed number of in structions formatted either as one large instruction or as a fixed instruction packet with the parallelism among ins tructions explicitly indicated by the instruction (Intel/HP I tanium)

Dr. Amr Talaat Elect 707  Reading Assignment: Intel Core i7 prediction scheme  Sections 3.3 & 3.6