CSE 431 Computer Architecture Fall 2005 Lecture 12: SS Front End (Fetch , Decode & Dispatch) Mary Jane Irwin ( www.cse.psu.edu/~mji ) www.cse.psu.edu/~cg431.

Slides:

Advertisements

Similar presentations

CH14 Instruction Level Parallelism and Superscalar Processors

Advertisements

Topics Left Superscalar machines IA64 / EPIC architecture

Computer Organization and Architecture

1 Lecture 11: Modern Superscalar Processor Models Generic Superscalar Models, Issue Queue-based Pipeline, Multiple-Issue Design.

Anshul Kumar, CSE IITD CSL718 : VLIW - Software Driven ILP Hardware Support for Exposing ILP at Compile Time 3rd Apr, 2006.

CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

Mary Jane Irwin ( ) [Adapted from Computer Organization and Design,

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

Chapter 14 Superscalar Processors. What is Superscalar? “Common” instructions (arithmetic, load/store, conditional branch) can be executed independently.

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

Goal: Reduce the Penalty of Control Hazards

Pipelined Datapath and Control (Lecture #15) ECE 445 – Computer Organization The slides included herein were taken from the materials accompanying Computer.

ENGS 116 Lecture 91 Dynamic Branch Prediction and Speculation Vincent H. Berk October 10, 2005 Reading for today: Chapter 3.2 – 3.6 Reading for Wednesday:

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.

CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.

CSE431 L07 Overcoming Data Hazards.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 07: Overcoming Data Hazards Mary Jane Irwin (

1 A Superscalar Pipeline [Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005 and Instruction Issue Logic, IEEETC, 39:3, Sohi,

Anshul Kumar, CSE IITD CSL718 : Superscalar Processors Speculative Execution 2nd Feb, 2006.

Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)

CSE431 L13 SS Execute & Commit.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 13: SS Backend (Execute, Writeback & Commit) Mary Jane.

CS203 – Advanced Computer Architecture ILP and Speculation.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

CSE431 L09 Multiple Issue Intro.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 09: Multiple Issue Introduction Mary Jane Irwin (

Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,

CS 352H: Computer Systems Architecture

Dynamic Scheduling Why go out of style?

Computer Architecture Chapter (14): Processor Structure and Function

Computer Organization

Dynamic Branch Prediction

William Stallings Computer Organization and Architecture 8th Edition

William Stallings Computer Organization and Architecture 8th Edition

/ Computer Architecture and Design

Chapter 9 a Instruction Level Parallelism and Superscalar Processors

CS203 – Advanced Computer Architecture

Lecture: Out-of-order Processors

Chapter 14 Instruction Level Parallelism and Superscalar Processors

/ Computer Architecture and Design

Chapter 4 The Processor Part 4

Instruction Level Parallelism and Superscalar Processors

Superscalar Processors & VLIW Processors

Lecture 10: Out-of-order Processors

Lecture 11: Out-of-order Processors

Lecture: Out-of-order Processors

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Instruction Level Parallelism and Superscalar Processors

Ka-Ming Keung Swamy D Ponpandi

Lecture 8: Dynamic ILP Topics: out-of-order processors

Adapted from the slides of Prof

Krste Asanovic Electrical Engineering and Computer Sciences

Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)

/ Computer Architecture and Design

Control unit extension for data hazards

* From AMD 1996 Publication #18522 Revision E

Adapted from the slides of Prof

Control unit extension for data hazards

CSC3050 – Computer Architecture

Created by Vivi Sahfitri

Wackiness Algorithm A: Algorithm B:

Control unit extension for data hazards

Guest Lecturer: Justin Hsia

Lecture 9: Dynamic ILP Topics: out-of-order processors

Conceptual execution on a processor which exploits ILP

Ka-Ming Keung Swamy D Ponpandi

Presentation transcript:

CSE 431 Computer Architecture Fall 2005 Lecture 12: SS Front End (Fetch , Decode & Dispatch) Mary Jane Irwin ( www.cse.psu.edu/~mji ) www.cse.psu.edu/~cg431 [Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005 and Superscalar Microprocessor Design, Johnson, © 1992] Other handouts To handout next time

SS Pipeline FETCH DECODE & DISPATCH ISSUE & EXECUTE WRITE BACK RESULT COMMIT 1st RUU function (RUU allocation, src operand copying (or Tag field set), dst Tag field set) Decode instructions 3nd RUU function (when both src operands Ready and FU free, schedule Result Bus and issue instr for execution) 2nd RUU function (copy Result Bus data to matching src’s and to RUU dst entry) 4th RUU function (for instr at RUU_Head (if executed), write dst Contents to RegFile) Fetch multiple instructions DECODE as does ROB allocation In Order In Order In Order Out of Order

Instruction Fetch Sequences Instruction run – number of instructions (run length) fetched between taken branches Instruction fetcher operates most efficiently when processing long runs – unfortunately runs are usually quite short The average run length is about six instructions Instruction bandwidth of only 1.125 instructions per cycle 9 instructions in 8 cycles Branch delay 4-way Instr Fetcher S1 S2 S3 S4 S5 T1 T2 T3 T4 Time (cycles) Run length is determined only by taken branches (branches that are not taken are counted as part of a longer run) Taken branches are typically less than 20% of the dynamic instructions executed by RISCs The average run length is about six instructions – half of the instructions runs are four instructions or less (Figure 4-3 in Johnson).

Instruction Fetch Inefficiencies Fetcher can’t provide adequate bandwidth to the decoder to exploit the available ILP because Decoder is idle while the outcome of the branch is determined Can (mostly) fix with dynamic branch prediction Instruction fetch misalignment prevents the decoder from operating at full capacity even when the decoder is processing valid instructions The fetcher can align fetched instructions to avoid wasted decoder slots If supported by dynamic branch prediction, the fetcher can also merge instructions from different runs Aligning and merging can only be done if the fetcher has the sufficient bandwidth (i.e., the fetch rate is faster than the decode rate)

Speedups of Fetch Alternatives From Johnson, 1992 Speedup Base: no prediction and no alignment Pred: dynamic branch prediction A 4-way instr fetcher out performs a 2-way instr fetcher It has twice the potential instruction bandwidth But it requires twice as much decoder hardware to keep up (e.g., in decoders and in ports and buses)

4-Way Decoder Implementation A 4-way instr fetcher has higher fetch bandwidth but at what cost ? 12 dependency checks between the 4 decode instruction op rs rt rd 8 read ports on the RegFile and 8 write ports to the RUU and 8 buses to distribute those source operands (or their RegFile addr || LI)

Reducing 4-Way Decoder Hardware Limiting the number of RegFile read ports, buses and RUU write ports is acceptable since Not all decoded instructions access two registers Not all decoded instruction are valid (because of misalignment) Some decoded instructions have dependences on one or more simultaneously decoded instructions From Johnson, 1992 The register demand means a 8 port capacity would be wasted 4 ports reduces average performance < 2% Fraction of total decoded parcels Arbitration halves the size of the resource and increases its utilization – if the resource (read-ports and buses) is halved # of Ports Used

Arbitrating Read Port and Bus Usage If only 4 read ports and 4 buses are provided, have to determine which instr’s get first access to them If more than 4 ports are needed to dispatch the decoded instructions, then instruction fetch and decode must stall Prioritized register identifier selection for port usage must be accomplished within about half a processor cycle If the first decoder position instr (i.e., the first instr in program order) requires register access, it is always enabled on the first and second, if it has two source operands, ports Such arbitration continues in sequence for the second, third, and fourth decoder position instr’s Arbitration halves the size of the resource and increases its utilization – if the resource (read-ports and buses) is halved

SS Branch Prediction Recall from Lecture 8 for branch prediction in a scalar pipeline we needed A mechanism to predict the branch outcome: a BHT (branch history table) in the fetch stage A way to fetch two instructions – the sequential instruction (I$) and the branch target instruction (BTB (branch target buffer)) A way to ensure that instructions active in the pipeline following the branch didn’t change the machine state until the branch outcome was known Allowed to complete (in order commit) on correct prediction Flushed on mispredict and restart With a SS machine, it is possible to have many such instructions after predicted branches active in the pipeline Flag instructions following branches as speculative until the branch outcome is known

Implementing Branches A SS processor could have more than one branch per fetch set and could have several uncompleted branches pending at any time I$ Fetch (BHT/BTB) Decode (Predict) Branch (check predict) Dispatch Must access BHT/BTB for all branch instr’s in the fetch set during fetch to reduced branch delay (i.e., need a 4 read-port BHT/BTB for 4-instr fetcher) Pass BHT information to decode stage After decode, choose between I$ set and BTB sets to determine the next fetch set Note that the BTB contains branch target instructions (NOT branch target addresses) – and also the PC address for the branch. That way during decode, can check to make sure that the BTB entry is for the RIGHT branch (since have a small BTB accessed by the low order bits of the PC during fetch) assuming prediction is branch taken (if not the sequential instructions just fetched from the I$ are the right ones to decode next). If the BTB entry is for the wrong branch, then throw the BTB instr set away and reissue fetch at the branch target address.

Decoding & Dispatching Branches While multiple branches could be dispatched per cycle, incur only a slight performance decrease (about 2%) from imposing a decoder limit of one branch per fetch set since typically only one branch per cycle can be executed (usually only have one branch FU) From Johnson, 1992 Having minimum branch delay is more important that decoding multiple branches per cycle Speedup First mention of branch-prediction list – will see in later slides that we keep the branches in the ROB so we know when they are the next instruction to commit (and that gives us the in-program-order list of branches)

Speculative Instructions Speculation – The processor (or compiler) guesses the outcome of an instruction (e.g., branches, loads) so as to enable execution of other instructions that depend on the speculated instruction One of the most important methods for finding more ILP in SS and VLIW processors Producing correct results requires result checking, recovery and restart hardware mechanisms Checking mechanisms to see if the prediction was correct Recovery mechanisms to cancel the effects of instructions that were issued under false assumptions (e.g., branch misprediction) Restart mechanisms to reestablish the correct instruction sequence For branches the correct program counter restart value is known when the branch outcome is determined

RUU Speculation Field Support For dependent speculative instr’s, the speculative flag is set to Yes until the outcome of the driving instr (i.e., the branch) is determined. Then an associate comparison of that branch’s PC addr and the RUU’s SIA fields can be done. functional unit executed speculative src operand 1 src operand 2 destination PC issued Tag Tag Tag Yes/No Ready Ready Content Content Content Yes/No Yes/No Address Unit Number Spec Instr Addr

Branch Execution If the branch was not mispredicted, then the branch and its trailing instructions can commit when at RUU_Head If the branch was mispredicted, then all subsequent instr’s must be discarded (even though subsequent branches may have been correctly predicted) When there is an exception, all of the RUU entries are discarded in a single cycle and instruction stream fetching restarts on the next cycle. Thus, the RUU provides an easy way to discard instructions coming after a mispredicted branch. Only one Branch unit!

Effects of RUU Size on Performance Since instruction decoding must stall when there is no free RUU entry, the RUU should be large enough to accept all instructions during the expected dispatch-to-commit time period From Johnson, 1992 Performance decreases markedly with 8 and 4 entries For 2 entries the performance is worse than the scalar processor Why? Speedup The scalar processor does not suffer the additional delay of writing results to the RUU before they are written to the RegFile (the pipeline has one additional stage with the SS processor) Number of IROB Entries (4-instr decoder)

Effects of LSQ Size on Performance Since instruction decoding must stall when there is no free LSQ entry, the LSQ should be large enough but the LSQ size has relatively little impact on performance With a 4-way decoder, a 4-entry LSQ only incurs a 1% speedup loss over an 8-entry LSQ Smaller LSQ facilitate dependency checking From Johnson, 1992

SS Fetch and Decode Pipeline Stages DECODE & DISPATCH ISSUE & EXECUTE 1st RUU function (RUU allocation, src operand copying (or Tag field set), dst Tag field set) If branch, use BHT info to determine next PC value and whether the I$ or BTB instr’s are the next fetch set Decode 4 instructions Access BHT/BTB for PC, PC+4, PC+8, PC+12 Fetch 4 instructions Align and merge 3nd RUU function (when both src operands Ready and FU free, schedule Result Bus and issue instr for execution) DECODE as does ROB allocation In Order In Order Out of Order

Next Lecture and Reminders MIPS superscalar back end (instruction executed and commit) Reading assignment – Sohi paper, Johnson Chapter 6, 7 & 8 (opt) Reminders HW3, Part 1 due October 13th (due IN CLASS) HW3, Part 2 (the SimpleScalar experiments) due Friday, October 21th (email solution to the TA, rdas@cse.psu.edu) by 5:00pm Evening midterm exam scheduled (one week away) Tuesday, October 18th, 20:15 to 22:15, Location 113 IST Only one student has contacted me about a conflict (you know who you are) so that is the only conflict exam that will be scheduled