ECE/CS 757: Advanced Computer Architecture II Instructor:Mikko H Lipasti Spring 2013 University of Wisconsin-Madison Lecture notes based on slides created.

Slides:

Advertisements

Similar presentations

Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.

Advertisements

Computer Organization and Architecture

CSCI 4717/5717 Computer Architecture

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.

THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.

Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

Instruction-Level Parallelism (ILP)

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

Memory Data Flow Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

Chapter 4 CSF 2009 The processor: Instruction-Level Parallelism.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

Chapter 12 Pipelining Strategies Performance Hazards.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.

1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )

Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

1 Lecture 10: ILP Innovations Today: handling memory dependences with the LSQ and innovations for each pipeline stage (Section 3.5)

ECE/CS 552: Introduction to Superscalar Processors Instructor: Mikko H Lipasti Fall 2010 University of Wisconsin-Madison Lecture notes partially based.

Pipelining to Superscalar Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

OOO execution © Avi Mendelson, 4/ MAMAS – Computer Architecture Lecture 7 – Out Of Order (OOO) Avi Mendelson Some of the slides were taken.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement.

1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.

Dynamic Pipelines. Interstage Buffers Superscalar Pipeline Stages In Program Order In Program Order Out of Order.

CSCE 614 Fall Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing.

1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.

ECE/CS 552: Pipelining Instructor: Mikko H Lipasti Fall 2010 University of Wisconsin-Madison Lecture notes based on set created by Mark Hill and John P.

1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.

The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.

Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.

ECE/CS 552: Shared Memory © Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim Smith.

Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.

ECE/CS 552: Pipeline Hazards © Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim.

1 Lecture 10: Memory Dependence Detection and Speculation Memory correctness, dynamic memory disambiguation, speculative disambiguation, Alpha Example.

ECE/CS 757: Advanced Computer Architecture II Instructor:Mikko H Lipasti Spring 2009 University of Wisconsin-Madison Lecture notes based on slides created.

PipeliningPipelining Computer Architecture (Fall 2006)

CS203 – Advanced Computer Architecture ILP and Speculation.

ECE/CS 552: Introduction to Superscalar Processors and the MIPS R10000 © Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill,

Use of Pipelining to Achieve CPI < 1

CS 352H: Computer Systems Architecture

Dynamic Scheduling Why go out of style?

ECE/CS 552: Pipelining © Prof. Mikko Lipasti

PowerPC 604 Superscalar Microprocessor

CS203 – Advanced Computer Architecture

ECE/CS 552: Pipelining to Superscalar

Flow Path Model of Superscalars

Morgan Kaufmann Publishers The Processor

Lecture 6: Advanced Pipelines

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Lecture 11: Memory Data Flow Techniques

Pipelining to Superscalar ECE/CS 752 Fall 2017

15-740/ Computer Architecture Lecture 5: Precise Exceptions

How to improve (decrease) CPI

Instruction Flow Techniques

* From AMD 1996 Publication #18522 Revision E

Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011

ECE/CS 552: Pipelining to Superscalar

CSC3050 – Computer Architecture

Lecture 10: ILP Innovations

Presentation transcript:

ECE/CS 757: Advanced Computer Architecture II Instructor:Mikko H Lipasti Spring 2013 University of Wisconsin-Madison Lecture notes based on slides created by John Shen, Mark Hill, David Wood, Guri Sohi, Jim Smith, Natalie Enright Jerger, Michel Dubois, Murali Annavaram, Per Stenström and probably others

Review of 752 Iron law Beyond pipelining Superscalar challenges Instruction flow Register data flow Memory Dataflow Modern memory interface

Iron Law Processor Performance = Time Program Architecture --> Implementation --> Realization Compiler Designer Processor Designer Chip Designer Instructions Cycles Program Instruction Time Cycle (code size) = XX (CPI) (cycle time)

Iron Law Instructions/Program – Instructions executed, not static code size – Determined by algorithm, compiler, ISA Cycles/Instruction – Determined by ISA and CPU organization – Overlap among instructions reduces this term Time/cycle – Determined by technology, organization, clever circuit design

Our Goal Minimize time, which is the product, NOT isolated terms Common error to miss terms while devising optimizations – E.g. ISA change to decrease instruction count – BUT leads to CPU organization which makes clock slower Bottom line: terms are inter-related

Pipelined Design Motivation: – Increase throughput with little increase in hardware. Bandwidth or Throughput = Performance Bandwidth (BW) = no. of tasks/unit time For a system that operates on one task at a time: – BW = 1/delay (latency) BW can be increased by pipelining if many operands exist which need the same operation, i.e. many repetitions of the same task are to be performed. Latency required for each task remains the same or may even increase slightly.

Ideal Pipelining Bandwidth increases linearly with pipeline depth Latency increases by latch delays

Example: Integer Multiplier 8 16x16 combinational multiplier ISCAS-85 C6288 standard benchmark Tools: Synopsys DC/LSI Logic 110nm gflxp ASIC [Source: J. Hayes, Univ. of Michigan]

Example: Integer Multiplier ConfigurationDelayMPSArea (FF/wiring)Area Increase Combinational3.52ns (--/1759) 2 Stages1.87ns534 (1.9x)8725 (1078/1870)16% 4 Stages1.17ns855 (3.0x)11276 (3388/2112)50% 8 Stages0.80ns1250 (4.4x)17127 (8938/2612)127% 9 Pipeline efficiency 2-stage: nearly double throughput; marginal area cost 4-stage: 75% efficiency; area still reasonable 8-stage: 55% efficiency; area more than doubles Tools: Synopsys DC/LSI Logic 110nm gflxp ASIC

Pipelining Idealisms Uniform subcomputations – Can pipeline into stages with equal delay – Balance pipeline stages Identical computations – Can fill pipeline with identical work – Unify instruction types Independent computations – No relationships between work units – Minimize pipeline stalls Are these practical? – No, but can get close enough to get significant speedup

Instruction Pipelining The “computation” to be pipelined. – Instruction Fetch (IF) – Instruction Decode (ID) – Operand(s) Fetch (OF) – Instruction Execution (EX) – Operand Store (OS) – Update Program Counter (PC)

Generic Instruction Pipeline Based on “obvious” subcomputations

© 2005 Mikko Lipasti 13 Pipelining Idealisms  Uniform subcomputations – Can pipeline into stages with equal delay – Balance pipeline stages  Identical computations – Can fill pipeline with identical work – Unify instruction types (example in 752 notes) Independent computations – No relationships between work units – Minimize pipeline stalls

© 2005 Mikko Lipasti 14 Program Dependences

© 2005 Mikko Lipasti 15 Program Data Dependences True dependence (RAW) – j cannot execute until i produces its result Anti-dependence (WAR) – j cannot write its result until i has read its sources Output dependence (WAW) – j cannot write its result until i has written its result

© 2005 Mikko Lipasti 16 Control Dependences Conditional branches – Branch must execute to determine which instruction to fetch next – Instructions following a conditional branch are control dependent on the branch instruction

© 2005 Mikko Lipasti 17 Resolution of Pipeline Hazards Pipeline hazards – Potential violations of program dependences – Must ensure program dependences are not violated Hazard resolution – Static: compiler/programmer guarantees correctness – Dynamic: hardware performs checks at runtime Pipeline interlock – Hardware mechanism for dynamic hazard resolution – Must detect and enforce dependences at runtime

© 2005 Mikko Lipasti 18 IBM RISC Experience [Agerwala and Cocke 1987] Internal IBM study: Limits of a scalar pipeline? Memory Bandwidth – Fetch 1 instr/cycle from I-cache – 40% of instructions are load/store (D-cache) Code characteristics (dynamic) – Loads – 25% – Stores 15% – ALU/RR – 40% – Branches – 20% 1/3 unconditional (always taken 1/3 conditional taken, 1/3 conditional not taken

© 2005 Mikko Lipasti 19 IBM Experience Cache Performance – Assume 100% hit ratio (upper bound) – Cache latency: I = D = 1 cycle default Load and branch scheduling – Loads 25% cannot be scheduled (delay slot empty) 65% can be moved back 1 or 2 instructions 10% can be moved back 1 instruction – Branches Unconditional – 100% schedulable (fill one delay slot) Conditional – 50% schedulable (fill one delay slot)

© 2005 Mikko Lipasti 20 CPI Optimizations Goal and impediments – CPI = 1, prevented by pipeline stalls No cache bypass of RF, no load/branch scheduling – Load penalty: 2 cycles: 0.25 x 2 = 0.5 CPI – Branch penalty: 2 cycles: 0.2 x 2/3 x 2 = 0.27 CPI – Total CPI: = 1.77 CPI Bypass, no load/branch scheduling – Load penalty: 1 cycle: 0.25 x 1 = 0.25 CPI – Total CPI: = 1.52 CPI

© 2005 Mikko Lipasti 21 More CPI Optimizations Bypass, scheduling of loads/branches – Load penalty: 65% + 10% = 75% moved back, no penalty 25% => 1 cycle penalty 0.25 x 0.25 x 1 = CPI – Branch Penalty 1/3 unconditional 100% schedulable => 1 cycle 1/3 cond. not-taken, => no penalty (predict not-taken) 1/3 cond. Taken, 50% schedulable => 1 cycle 1/3 cond. Taken, 50% unschedulable => 2 cycles 0.25 x [1/3 x 1 + 1/3 x 0.5 x 1 + 1/3 x 0.5 x 2] = Total CPI: = 1.23 CPI

© 2005 Mikko Lipasti 22 Simplify Branches Assume 90% can be PC-relative – No register indirect, no register access – Separate adder (like MIPS R3000) – Branch penalty reduced Total CPI: = 1.15 CPI = 0.87 IPC PC-relativeSchedulablePenalty Yes (90%)Yes (50%)0 cycle Yes (90%)No (50%)1 cycle No (10%)Yes (50%)1 cycle No (10%)No (50%)2 cycles 15% Overhead from program dependences

Limits of Pipelining IBM RISC Experience – Control and data dependences add 15% – Best case CPI of 1.15, IPC of 0.87 – Deeper pipelines (higher frequency) magnify dependence penalties This analysis assumes 100% cache hit rates – Hit rates approach 100% for some programs – Many important programs have much worse hit rates

Processor Performance In the 1980’s (decade of pipelining): – CPI: 5.0 => 1.15 In the 1990’s (decade of superscalar): – CPI: 1.15 => 0.5 (best case) In the 2000’s (decade of multicore): – Core CPI unchanged; chip CPI scales with #cores Processor Performance = Time Program Instructions Cycles Program Instruction Time Cycle (code size) = XX (CPI) (cycle time)

Limits on Instruction Level Parallelism (ILP) Weiss and Smith [1984]1.58 Sohi and Vajapeyam [1987]1.81 Tjaden and Flynn [1970]1.86 (Flynn’s bottleneck) Tjaden and Flynn [1973]1.96 Uht [1986]2.00 Smith et al. [1989]2.00 Jouppi and Wall [1988]2.40 Johnson [1991]2.50 Acosta et al. [1986]2.79 Wedig [1982]3.00 Butler et al. [1991]5.8 Melvin and Patt [1991]6 Wall [1991]7 (Jouppi disagreed) Kuck et al. [1972]8 Riseman and Foster [1972]51 (no control dependences) Nicolau and Fisher [1984]90 (Fisher’s optimism)

Superscalar Proposal Go beyond single instruction pipeline, achieve IPC > 1 Dispatch multiple instructions per cycle Provide more generally applicable form of concurrency (not just vectors) Geared for sequential code that is hard to parallelize otherwise Exploit fine-grained or instruction-level parallelism (ILP)

Limitations of Scalar Pipelines Scalar upper bound on throughput – IPC = 1 Inefficient unified pipeline – Long latency for each instruction Rigid pipeline stall policy – One stalled instruction stalls all newer instructions

Parallel Pipelines

Power4 Diversified Pipelines PC I-Cache BR Scan BR Predict Fetch Q Decode Reorder Buffer BR/CR Issue Q CR Unit BR Unit FX/LD 1 Issue Q FX1 Unit LD1 Unit FX/LD 2 Issue Q LD2 Unit FX2 Unit FP Issue Q FP1 Unit FP2 Unit StQ D-Cache

Rigid Pipeline Stall Policy Bypassing of Stalled Instruction Stalled Instruction Backward Propagation of Stalling Not Allowed

Dynamic Pipelines

Limitations of Scalar Pipelines Scalar upper bound on throughput – IPC = 1 – Solution: wide (superscalar) pipeline Inefficient unified pipeline – Long latency for each instruction – Solution: diversified, specialized pipelines Rigid pipeline stall policy – One stalled instruction stalls all newer instructions – Solution: Out-of-order execution, distributed execution pipelines

High-IPC Processor Evolution Mikko Lipasti-University of Wisconsin 33 Desktop/Workstation Market Scalar RISC Pipeline 1980s: MIPS SPARC Intel Issue In-order Early 1990s: IBM RIOS-I Intel Pentium Limited Out- of-Order Mid 1990s: PowerPC 604 Intel P6 Large ROB Out-of-Order 2000s: DEC Alpha IBM Power4/5 AMD K – 2005: 20 years, 100x frequency Mobile Market Scalar RISC Pipeline 2002: ARM Issue In-order 2005: Cortex A8 Limited Out- of-Order 2009: Cortex A9 Large ROB Out-of-Order 2011: Cortex A – 2011: 10 years, 10x frequency

Superscalar Overview Instruction flow – Branches, jumps, calls: predict target, direction – Fetch alignment – Instruction cache misses Register data flow – Register renaming: RAW/WAR/WAW Memory data flow – In-order stores: WAR/WAW – Store queue: RAW – Data cache misses

High-IPC Processor Mikko Lipasti-University of Wisconsin 35

Goal and Impediments Goal of Instruction Flow – Supply processor with maximum number of useful instructions every clock cycle Impediments – Branches and jumps – Finite I-Cache Capacity Bandwidth restrictions

Limits on Instruction Level Parallelism (ILP) Weiss and Smith [1984]1.58 Sohi and Vajapeyam [1987]1.81 Tjaden and Flynn [1970]1.86 (Flynn’s bottleneck) Tjaden and Flynn [1973]1.96 Uht [1986]2.00 Smith et al. [1989]2.00 Jouppi and Wall [1988]2.40 Johnson [1991]2.50 Acosta et al. [1986]2.79 Wedig [1982]3.00 Butler et al. [1991]5.8 Melvin and Patt [1991]6 Wall [1991]7 (Jouppi disagreed) Kuck et al. [1972]8 Riseman and Foster [1972]51 (no control dependences) Nicolau and Fisher [1984]90 (Fisher’s optimism)

© 2005 Mikko Lipasti 38 Speculative Execution Riseman & Foster showed potential – But no idea how to reap benefit 1979: Jim Smith patents branch prediction at Control Data – Predict current branch based on past history Today: virtually all processors use branch prediction

Instruction Flow Challenges: – Branches: unpredictable – Branch targets misaligned – Instruction cache misses Solutions – Prediction and speculation – High-bandwidth fetch logic – Nonblocking cache and prefetching 39 Instruction Cache PC only3 instructions fetched Objective: Fetch multiple instructions per cycle Mikko Lipasti-University of Wisconsin

Disruption of Instruction Flow 40 Mikko Lipasti-University of Wisconsin

Branch Prediction Target address generation  Target Speculation – Access register: PC, General purpose register, Link register – Perform calculation: +/- offset, autoincrement Condition resolution  Condition speculation – Access register: Condition code register, General purpose register – Perform calculation: Comparison of data register(s) 41 Mikko Lipasti-University of Wisconsin

Target Address Generation 42 Mikko Lipasti-University of Wisconsin

Branch Condition Resolution 43 Mikko Lipasti-University of Wisconsin

Branch Instruction Speculation 44 Mikko Lipasti-University of Wisconsin

Hardware Smith Predictor Jim E. Smith. A Study of Branch Prediction Strategies. International Symposium on Computer Architecture, pages , May 1981 Widely employed: Intel Pentium, PowerPC 604, MIPS R10000, etc. 45 Mikko Lipasti-University of Wisconsin

Cortex A15: Bi-Mode Predictor PHT partitioned into T/NT halves – Selector chooses source Reduces negative interference, since most entries in PHT 0 tend towards NT, and most entries in PHT 1 tend towards T Mikko Lipasti-University of Wisconsin 46 15% of A15 Core Power!

Branch Target Prediction Does not work well for function/procedure returns Does not work well for virtual functions, switch statements 47 Mikko Lipasti-University of Wisconsin

Branch Speculation Leading Speculation – Done during the Fetch stage – Based on potential branch instruction(s) in the current fetch group Trailing Confirmation – Done during the Branch Execute stage – Based on the next Branch instruction to finish execution 48 Mikko Lipasti-University of Wisconsin

Branch Speculation Start new correct path – Must remember the alternate (non-predicted) path Eliminate incorrect path – Must ensure that the mis-speculated instructions produce no side effects 49 Mikko Lipasti-University of Wisconsin

Mis-speculation Recovery Start new correct path 1.Update PC with computed branch target (if predicted NT) 2.Update PC with sequential instruction address (if predicted T) 3.Can begin speculation again at next branch Eliminate incorrect path 1.Use tag(s) to deallocate resources occupied by speculative instructions 2.Invalidate all instructions in the decode and dispatch buffers, as well as those in reservation stations 50 Mikko Lipasti-University of Wisconsin

Parallel Decode Primary Tasks – Identify individual instructions (!) – Determine instruction types – Determine dependences between instructions Two important factors – Instruction set architecture – Pipeline width 51 Mikko Lipasti-University of Wisconsin

Pentium Pro Fetch/Decode 52 Mikko Lipasti-University of Wisconsin

Dependence Checking Trailing instructions in fetch group – Check for dependence on leading instructions 53 DestSrc0Src1DestSrc0Src1DestSrc0Src1DestSrc0Src1 ?= Mikko Lipasti-University of Wisconsin

Summary: Instruction Flow Fetch group alignment Target address generation – Branch target buffer Branch condition prediction Speculative execution – Tagging/tracking instructions – Recovering from mispredicted branches Decoding in parallel 54 Mikko Lipasti-University of Wisconsin

High-IPC Processor Mikko Lipasti-University of Wisconsin 55

Register Data Flow Parallel pipelines – Centralized instruction fetch – Centralized instruction decode Diversified execution pipelines – Distributed instruction execution Data dependence linking – Register renaming to resolve true/false dependences – Issue logic to support out-of-order issue – Reorder buffer to maintain precise state 56 Mikko Lipasti-University of Wisconsin

Issue Queues and Execution Lanes 57 Source: theregister.co.uk ARM Cortex A15 Mikko Lipasti-University of Wisconsin

Program Data Dependences True dependence (RAW) – j cannot execute until i produces its result Anti-dependence (WAR) – j cannot write its result until i has read its sources Output dependence (WAW) – j cannot write its result until i has written its result 58 Mikko Lipasti-University of Wisconsin

Register Data Dependences Program data dependences cause hazards – True dependences (RAW) – Antidependences (WAR) – Output dependences (WAW) When are registers read and written? – Out of program order! – Hence, any and all of these can occur Solution to all three: register renaming 59 Mikko Lipasti-University of Wisconsin

Register Renaming: WAR/WAW Widely employed (Core i7, Cortex A15, …) Resolving WAR/WAW: – Each register write gets unique “rename register” – Writes are committed in program order at Writeback – WAR and WAW are not an issue All updates to “architected state” delayed till writeback Writeback stage always later than read stage – Reorder Buffer (ROB) enforces in-order writeback 60 Add R3 <= …P32 <= … Sub R4 <= …P33 <= … And R3 <= …P35 <= … Mikko Lipasti-University of Wisconsin

Register Renaming: RAW In order, at dispatch: – Source registers checked to see if “in flight” Register map table keeps track of this If not in flight, can be read from the register file If in flight, look up “rename register” tag (IOU) – Then, allocate new register for register write 61 Add R3 <= R2 + R1P32 <= P2 + P1 Sub R4 <= R3 + R1P33 <= P32 + P1 And R3 <= R4 & R2P35 <= P33 + P2 Mikko Lipasti-University of Wisconsin

Register Renaming: RAW Advance instruction to instruction queue – Wait for rename register tag to trigger issue Issue queue/reservation station enables out- of-order issue – Newer instructions can bypass stalled instructions 62 Source: theregister.co.uk Mikko Lipasti-University of Wisconsin

© Shen, Lipasti 63 Physical Register File Used in the MIPS R10000 pipeline, Intel Sandy/Ivybridge All registers in one place – Always accessed right before EX stage – No copying to real register file Fetch Decode Rename Issue RF Read Execute Agen-D$ ALU RF Write D$ Load RF Write Physical Register File Map Table R0 => P7 R1 => P3 … R31 => P39

© Shen, Lipasti 64 Managing Physical Registers What to do when all physical registers are in use? – Must release them somehow to avoid stalling – Maintain free list of “unused” physical registers Release when no more uses are possible – Sufficient: next write commits Map Table R0 => P7 R1 => P3 … R31 => P39 Add R3 <= R2 + R1P32 <= P2 + P1 Sub R4 <= R3 + R1P33 <= P32 + P1 … And R3 <= R4 & R2P35 <= P33 + P2 Release P32 (previous R3) when this instruction completes execution

High-IPC Processor Mikko Lipasti-University of Wisconsin 65

Memory Data Flow Resolve WAR/WAW/RAW memory dependences – MEM stage can occur out of order Provide high bandwidth to memory hierarchy – Non-blocking caches 66 Mikko Lipasti-University of Wisconsin

Memory Data Dependences Besides branches, long memory latencies are one of the biggest performance challenges today. To preserve sequential (in-order) state in the data caches and external memory (so that recovery from exceptions is possible) stores are performed in order. This takes care of antidependences and output dependences to memory locations. However, loads can be issued out of order with respect to stores if the out-of-order loads check for data dependences with respect to previous, pending stores. WAWWARRAW store Xload Xstore X ::: store Xstore Xload X

Memory Data Dependences “Memory Aliasing” = Two memory references involving the same memory location (collision of two memory addresses). “Memory Disambiguation” = Determining whether two memory references will alias or not (whether there is a dependence or not). Memory Dependency Detection: – Must compute effective addresses of both memory references – Effective addresses can depend on run-time data and other instructions – Comparison of addresses require much wider comparators Example code: (1)STOREV (2)ADD (3)LOADW (4)LOADX (5)LOADV (6)ADD (7)STOREW RAW WAR

© Shen, Lipasti 69 Memory Data Dependences WAR/WAW: stores commit in order – Hazards not possible. RAW: loads must check pending stores – Store queue keeps track of pending store addresses – Loads check against these addresses – Similar to register bypass logic – Comparators are 32 or 64 bits wide (address size) Major source of complexity in modern designs – Store queue lookup is position-based – What if store address is not yet known? Stall all trailing ops Store Queue Load/Store RS Agen Reorder Buffer Mem

Optimizing Load/Store Disambiguation Non-speculative load/store disambiguation 1.Loads wait for addresses of all prior stores 2.Full address comparison 3.Bypass if no match, forward if match (1) can limit performance: load r5,MEM[r3]  cache miss store r7, MEM[r5]  RAW for agen, stalled … load r8, MEM[r9]  independent load stalled

Speculative Disambiguation What if aliases are rare? 1.Loads don’t wait for addresses of all prior stores 2.Full address comparison of stores that are ready 3.Bypass if no match, forward if match 4.Check all store addresses when they commit –No matching loads – speculation was correct –Matching unbypassed load – incorrect speculation 5.Replay starting from incorrect load Load Queue Store Queue Load/Store RS Agen Reorder Buffer Mem

Speculative Disambiguation: Load Bypass Load Queue Store Queue Agen Reorder Buffer Mem i1: st R3, MEM[R8]: ?? i2: ld R9, MEM[R4]: ?? i1: st R3, MEM[R8]: x800Ai2: ld R9, MEM[R4]: x400A i1 and i2 issue in program order i2 checks store queue (no match)

Speculative Disambiguation: Load Forward Load Queue Store Queue Agen Reorder Buffer Mem i1: st R3, MEM[R8]: ?? i2: ld R9, MEM[R4]: ?? i1: st R3, MEM[R8]: x800Ai2: ld R9, MEM[R4]: x800A i1 and i2 issue in program order i2 checks store queue (match=>forward)

Speculative Disambiguation: Safe Speculation Load Queue Store Queue Agen Reorder Buffer Mem i1: st R3, MEM[R8]: ?? i2: ld R9, MEM[R4]: ?? i1: st R3, MEM[R8]: x800Ai2: ld R9, MEM[R4]: x400C i1 and i2 issue out of program order i1 checks load queue at commit (no match)

Speculative Disambiguation: Violation Load Queue Store Queue Agen Reorder Buffer Mem i1: st R3, MEM[R8]: ?? i2: ld R9, MEM[R4]: ?? i1: st R3, MEM[R8]: x800Ai2: ld R9, MEM[R4]: x800A i1 and i2 issue out of program order i1 checks load queue at commit (match) – i2 marked for replay

Use of Prediction If aliases are rare: static prediction – Predict no alias every time Why even implement forwarding? PowerPC 620 doesn’t – Pay misprediction penalty rarely If aliases are more frequent: dynamic prediction – Use PHT-like history table for loads If alias predicted: delay load If aliased pair predicted: forward from store to load – More difficult to predict pair [store sets, Alpha 21264] – Pay misprediction penalty rarely Memory cloaking [Moshovos, Sohi] – Predict load/store pair – Directly copy store data register to load target register – Reduce data transfer latency to absolute minimum

Load/Store Disambiguation Discussion RISC ISA: – Many registers, most variables allocated to registers – Aliases are rare – Most important to not delay loads (bypass) – Alias predictor may/may not be necessary CISC ISA: – Few registers, many operands from memory – Aliases much more common, forwarding necessary – Incorrect load speculation should be avoided – If load speculation allowed, predictor probably necessary Address translation: – Can’t use virtual address (must use physical) – Wait till after TLB lookup is done – Or, use subset of untranslated bits (page offset) Safe for proving inequality (bypassing OK) Not sufficient for showing equality (forwarding not OK)

The Memory Bottleneck

Increasing Memory Bandwidth 79 Expensive to duplicate Complex, concurrent FSMs Mikko Lipasti-University of Wisconsin

Coherent Memory Interface

Load Queue – Tracks inflight loads for aliasing, coherence Store Queue – Defers stores until commit, tracks aliasing Storethrough Queue or Write Buffer or Store Buffer – Defers stores, coalesces writes, must handle RAW MSHR – Tracks outstanding misses, enables lockup-free caches [Kroft ISCA 91] Snoop Queue – Buffers, tracks incoming requests from coherent I/O, other processors Fill Buffer – Works with MSHR to hold incoming partial lines Writeback Buffer – Defers writeback of evicted line (demand miss handled first)

Split Transaction Bus “Packet switched” vs. “circuit switched” Release bus after request issued Allow multiple concurrent requests to overlap memory latency Complicates control, arbitration, and coherence protocol – Transient states for pending blocks (e.g. “req. issued but not completed”)

Memory Consistency How are memory references from different processors interleaved? If this is not well-specified, synchronization becomes difficult or even impossible – ISA must specify consistency model Common example using Dekker’s algorithm for synchronization – If load reordered ahead of store (as we assume for a baseline OOO CPU) – Both Proc0 and Proc1 enter critical section, since both observe that other’s lock variable (A/B) is not set If consistency model allows loads to execute ahead of stores, Dekker’s algorithm no longer works – Common ISAs allow this: IA-32, PowerPC, SPARC, Alpha

Sequential Consistency [Lamport 1979] Processors treated as if they are interleaved processes on a single time-shared CPU All references must fit into a total global order or interleaving that does not violate any CPU’s program order – Otherwise sequential consistency not maintained Now Dekker’s algorithm will work Appears to preclude any OOO memory references – Hence precludes any real benefit from OOO CPUs

High-Performance Sequential Consistency Coherent caches isolate CPUs if no sharing is occurring – Absence of coherence activity means CPU is free to reorder references Still have to order references with respect to misses and other coherence activity (snoops) Key: use speculation – Reorder references speculatively – Track which addresses were touched speculatively – Force replay (in order execution) of such references that collide with coherence activity (snoops)

High-Performance Sequential Consistency Load queue records all speculative loads Bus writes/upgrades are checked against LQ Any matching load gets marked for replay At commit, loads are checked and replayed if necessary – Results in machine flush, since load-dependent ops must also replay Practically, conflicts are rare, so expensive flush is OK

Maintaining Precise State Out-of-order execution – ALU instructions – Load/store instructions In-order completion/retirement – Precise exceptions Solutions – Reorder buffer retires instructions in order – Store queue retires stores in order – Exceptions can be handled at any instruction boundary by reconstructing state out of ROB/SQ – Load queue monitors remote stores 87 Mikko Lipasti-University of Wisconsin ROB Head Tail

© Shen, Lipasti 88 Superscalar Summary

© Shen, Lipasti 89 [John DeVale & Bryan Black, 2005]

Review of 752 Iron law Beyond pipelining Superscalar challenges Instruction flow Register data flow Memory Dataflow Modern memory interface What was not covered – Memory hierarchy (review later) – Virtual memory (read 4.4 in book) – Power & reliability (read ch. 2 in book) – Many implementation/design details – Etc. – Multithreading (coming up next)