ECE/CS 552: Introduction to Superscalar Processors and the MIPS R10000 © Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill,

Slides:



Advertisements
Similar presentations
Computer Organization and Architecture
Advertisements

1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.
Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
CS 152 Computer Architecture and Engineering Lecture 15 - Advanced Superscalars Krste Asanovic Electrical Engineering and Computer Sciences University.
Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.
March 9, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Krste Asanovic Electrical.
1 Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections )
The PowerPC Architecture  IBM, Motorola, and Apple Alliance  Based on the IBM POWER Architecture ­Facilitate parallel execution ­Scale well with advancing.
1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )
ECE/CS 552: Introduction to Superscalar Processors Instructor: Mikko H Lipasti Fall 2010 University of Wisconsin-Madison Lecture notes partially based.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement.
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
Dynamic Pipelines. Interstage Buffers Superscalar Pipeline Stages In Program Order In Program Order Out of Order.
1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.
1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.
ECE/CS 552: Pipeline Hazards © Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim.
Samira Khan University of Virginia Feb 9, 2016 COMPUTER ARCHITECTURE CS 6354 Precise Exception The content and concept of this course are adapted from.
1 Lecture 10: Memory Dependence Detection and Speculation Memory correctness, dynamic memory disambiguation, speculative disambiguation, Alpha Example.
CS203 – Advanced Computer Architecture ILP and Speculation.
Use of Pipelining to Achieve CPI < 1
Lecture: Out-of-order Processors
CS 352H: Computer Systems Architecture
Dynamic Scheduling Why go out of style?
CSL718 : Superscalar Processors
William Stallings Computer Organization and Architecture 8th Edition
/ Computer Architecture and Design
PowerPC 604 Superscalar Microprocessor
Part IV Data Path and Control
Out of Order Processors
CS203 – Advanced Computer Architecture
Lecture: Out-of-order Processors
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Pipeline Implementation (4.6)
Flow Path Model of Superscalars
Instruction Level Parallelism and Superscalar Processors
Lecture 6: Advanced Pipelines
Out of Order Processors
Superscalar Processors & VLIW Processors
Lecture 10: Out-of-order Processors
Lecture 11: Out-of-order Processors
Lecture: Out-of-order Processors
Module 3: Branch Prediction
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Lecture 11: Memory Data Flow Techniques
Lecture: Out-of-order Processors
Lecture 8: Dynamic ILP Topics: out-of-order processors
Adapted from the slides of Prof
15-740/ Computer Architecture Lecture 5: Precise Exceptions
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
Control unit extension for data hazards
* From AMD 1996 Publication #18522 Revision E
Adapted from the slides of Prof
Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011
CSC3050 – Computer Architecture
Dynamic Hardware Prediction
Created by Vivi Sahfitri
Dynamic Pipelines Like Wendy’s: once ID/RD has determined what you need, you get queued up, and others behind you can get past you. In-order front end,
Lecture 10: ILP Innovations
Lecture 9: Dynamic ILP Topics: out-of-order processors
Conceptual execution on a processor which exploits ILP
ECE/CS 552: Introduction to Superscalar Processors
Presentation transcript:

ECE/CS 552: Introduction to Superscalar Processors and the MIPS R10000 © Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim Smith

Superscalar Processors Challenges & Impediments – Instruction flow – Register data flow – Memory data flow MIPS R10000 reading – [Kenneth C. Yeager. The MIPS R10000 Superscalar Microprocessor, IEEE Micro, April 1996] – Exemplifies the techniques described in this lecture 2

Limitations of Scalar Pipelines Scalar upper bound on throughput – IPC = 1 Inefficient unified pipeline – Long latency for each instruction Rigid pipeline stall policy – One stalled instruction stalls all newer instructions 3

Parallel Pipelines 4

Intel Pentium Parallel Pipeline 5

Diversified Pipelines 6

IBM Power4 Diversified Pipelines 7 PC I-Cache BR Scan BR Predict Fetch Q Decode Reorder Buffer BR/CR Issue Q CR Unit BR Unit FX/LD 1 Issue Q FX1 Unit LD1 Unit FX/LD 2 Issue Q LD2 Unit FX2 Unit FP Issue Q FP1 Unit FP2 Unit StQ D-Cache

Rigid Pipeline Stall Policy 8 Bypassing of Stalled Instruction Stalled Instruction Backward Propagation of Stalling Not Allowed

Dynamic Pipelines 9

Superscalar Pipeline Stages 10 In Program Order In Program Order Out of Order

Limitations of Scalar Pipelines Scalar upper bound on throughput – IPC = 1 – Solution: wide (superscalar) pipeline Inefficient unified pipeline – Long latency for each instruction – Solution: diversified, specialized pipelines Rigid pipeline stall policy – One stalled instruction stalls all newer instructions – Solution: Out-of-order execution, distributed execution pipelines 11

Impediments to High IPC 12

Instruction Flow Challenges: – Branches: control dependences – Branch target misalignment – Instruction cache misses Solutions – Alignment hardware – Prediction/speculation – Nonblocking cache 13 Instruction Memory PC 3 instructions fetched Objective: Fetch multiple instructions per cycle

Disruption of Sequential Control Flow 14

Branch Prediction Target address generation  Target Speculation – Access register: PC, General purpose register, Link register – Perform calculation: +/- offset, autoincrement Condition resolution  Condition speculation – Access register: Condition code register, General purpose register – Perform calculation: Comparison of data register(s) 15

Target Address Generation 16

Condition Resolution 17

Branch Instruction Speculation 18

Static Branch Prediction Single-direction – Always not-taken: Intel i486 Backwards Taken/Forward Not Taken – Loop-closing branches have negative offset – Used as backup in Pentium Pro, II, III, 4 19

Static Branch Prediction Profile-based 1.Instrument program binary 2.Run with representative (?) input set 3.Recompile program a.Annotate branches with hint bits, or b.Restructure code to match predict not-taken Performance: 75-80% accuracy – Much higher for “easy” cases 20

Dynamic Branch Prediction Main advantages: – Learn branch behavior autonomously No compiler analysis, heuristics, or profiling – Adapt to changing branch behavior Program phase changes branch behavior First proposed in 1980 – US Patent #4,370,711, Branch predictor using random access memory, James. E. Smith Continually refined since then 21

Smith Predictor Hardware Jim E. Smith. A Study of Branch Prediction Strategies. International Symposium on Computer Architecture, pages , May 1981 Widely employed: Intel Pentium, PowerPC 604, MIPS R10000, etc. More sophisticated predictors used in more recent designs – Covered in detail in

Branch Target Prediction Does not work well for function/procedure returns – Special return address stack (RAS) structure covered in

Branch Speculation Leading Speculation – Typically done during the Fetch stage – Based on potential branch instruction(s) in the current fetch group Trailing Confirmation – Typically done during the Branch Execute stage – Based on the next Branch instruction to finish execution 24

Branch Speculation Start new correct path – Must remember the alternate (non-predicted) path Eliminate incorrect path – Must ensure that the mis-speculated instructions produce no side effects 25

Mis-speculation Recovery Start new correct path 1.Update PC with computed branch target (if predicted NT) 2.Update PC with sequential instruction address (if predicted T) 3.Can begin speculation again at next branch Eliminate incorrect path 1.Use tag(s) to deallocate resources occupied by speculative instructions 2.Invalidate all instructions in the decode and dispatch buffers, as well as those in reservation stations 26

Parallel Decode Primary Tasks – Identify individual instructions (!) – Determine instruction types – Determine dependences between instructions Two important factors – Instruction set architecture – Pipeline width 27

Pentium Pro Fetch/Decode 28

Dependence Checking Trailing instructions in fetch group – Check for dependence on leading instructions 29 DestSrc0Src1DestSrc0Src1DestSrc0Src1DestSrc0Src1 ?=

Summary: Instruction Flow Fetch group alignment Target address generation – Branch target buffer Target condition generation – Static prediction – Dynamic prediction Speculative execution – Tagging/tracking instructions – Recovering from mispredicted branches Parallel decode 30

Register Data Flow Parallel pipeline – Centralized instruction fetch – Centralized instruction decode Diversified pipeline – Distributed instruction execution Data dependence linking – Register renaming to eliminate WAR/WAW – Issue logic to support out-of-order issue – Reorder buffer to maintain precise state 31

MIPS R10000 Issue Queues 32

Register Data Dependences Program data dependences cause hazards – True dependences (RAW) – Antidependences (WAR) – Output dependences (WAW) When are registers read and written? – Out of program order! – Hence, any/all of these can occur Solution to all three: register renaming 33

Register Renaming: WAR/WAW Widely employed (Core i7, MIPS R10000, …) Resolving WAR/WAW: – Each register write gets unique “rename register” – Writes are committed in program order at Writeback – WAR and WAW are not an issue All updates to “architected state” delayed till writeback Writeback stage always later than read stage – Reorder Buffer (ROB) enforces in-order writeback 34 Add R3 <= …P32 <= … Sub R4 <= …P33 <= … And R3 <= …P35 <= …

Register Renaming: RAW In order, at dispatch: – Source registers checked to see if “in flight” Register map table keeps track of this If not in flight, can be read from the register file If in flight, look up “rename register” tag (IOU) – Then, allocate new register for register write 35 Add R3 <= R2 + R1P32 <= P2 + P1 Sub R4 <= R3 + R1P33 <= P32 + P1 And R3 <= R4 & R2P35 <= P33 + P2

Register Renaming: RAW Advance instruction to instruction queue – Wait for rename register tag to trigger issue Issue queue/reservation station enables out- of-order issue – Newer instructions can bypass stalled instructions 36

Physical Register File Used in MIPS R10000, Core i7, AMD Bulldozer All registers in one place – Always accessed right before EX stage – No copying to real register file at commit 37 Fetch Decode Rename Issue RF Read Execute Agen-D$ ALU RF Write D$ Load RF Write Physical Register File Map Table R0 => P7 R1 => P3 … R31 => P39

Managing Physical Registers What to do when all physical registers are in use? – Must release them somehow to avoid stalling – Maintain free list of “unused” physical registers Release when no more uses are possible – Sufficient: next write commits 38 Map Table R0 => P7 R1 => P3 … R31 => P39 Add R3 <= R2 + R1P32 <= P2 + P1 Sub R4 <= R3 + R1P33 <= P32 + P1 … And R3 <= R4 & R2P35 <= P33 + P2 Release P32 (previous R3) when this instruction completes execution

Memory Data Flow Resolve WAR/WAW/RAW memory dependences – MEM stage can occur out of order Provide high bandwidth to memory hierarchy – Non-blocking caches 39

Memory Data Dependences WAR/WAW: stores commit in order – Hazards not possible. Why? RAW: loads must check pending stores – Store queue keeps track of pending store addresses – Loads check against these addresses – Similar to register bypass logic – Comparators are 64 bits wide (address size) Major source of complexity in modern designs – Store queue lookup is position-based – What if store address is not yet known? 40 Store Queue Load/Store RS Agen Reorder Buffer Mem

Increasing Memory Bandwidth 41

Maintaining Precise State Out-of-order execution – ALU instructions – Load/store instructions In-order completion/retirement – Precise exceptions Solutions – Reorder buffer retires instructions in order – Store queue retires stores in order – Exceptions can be handled at any instruction boundary by reconstructing state out of ROB/SQ 42

A Dynamic Superscalar Processor 43

MIPS R10000 Details in paper [Kenneth C. Yeager. The MIPS R10000 Superscalar Microprocessor, IEEE Micro, April 1996] Clean, straightforward implementation of many of the concepts presented in this lecture 44

Superscalar Summary 45