CDA 5155 Out-of-order execution: Pentium Pro/II/III Week 7.

Slides:

Advertisements

Similar presentations

Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.

Advertisements

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.

1 Lecture 11: Modern Superscalar Processor Models Generic Superscalar Models, Issue Queue-based Pipeline, Multiple-Issue Design.

EECS 470 ILP and Exceptions Lecture 7 Coverage: Chapter 3.

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.

Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.

THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

EECS 470 Lecture 7 Branches: Address prediction and recovery (And interrupt recovery too.)

February 28, 2012CS152, Spring 2012 CS 152 Computer Architecture and Engineering Lecture 11 - Out-of-Order Issue, Register Renaming, & Branch Prediction.

National & Kapodistrian University of Athens Dep.of Informatics & Telecommunications MSc. In Computer Systems Technology Advanced Computer Architecture.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

CS 152 Computer Architecture and Engineering Lecture 15 - Advanced Superscalars Krste Asanovic Electrical Engineering and Computer Sciences University.

1 Lecture 8: Branch Prediction, Dynamic ILP Topics: branch prediction, out-of-order processors (Sections )

EECS 470 Register Renaming Lecture 8 Coverage: Chapter 3.

Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.

March 9, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Krste Asanovic Electrical.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

1 Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections )

1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )

EECS 470 Dynamic Scheduling – Part II Lecture 10 Coverage: Chapter 3.

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

EECS 470 Memory Scheduling Lecture 11 Coverage: Chapter 3.

Lecture 5. Dynamic Scheduling II Prof. Taeweon Suh Computer Science Education Korea University COM515 Advanced Computer Architecture.

Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.

1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.

1 Lecture 7: Speculative Execution and Recovery using Reorder Buffer Branch prediction and speculative execution, precise interrupt, reorder buffer.

© Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture Instruction Execution: Dynamic Scheduling.

EECS 470 Lecture 7 Branches: Address prediction and recovery (And interrupt recovery too.)

1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.

Anshul Kumar, CSE IITD CSL718 : Superscalar Processors Speculative Execution 2nd Feb, 2006.

OOO Pipelines - II Smruti R. Sarangi IIT Delhi 1.

1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.

Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.

Samira Khan University of Virginia Feb 9, 2016 COMPUTER ARCHITECTURE CS 6354 Precise Exception The content and concept of this course are adapted from.

CS203 – Advanced Computer Architecture ILP and Speculation.

Lecture: Out-of-order Processors

CS 352H: Computer Systems Architecture

Dynamic Scheduling Why go out of style?

/ Computer Architecture and Design

Smruti R. Sarangi IIT Delhi

PowerPC 604 Superscalar Microprocessor

CIS-550 Advanced Computer Architecture Lecture 10: Precise Exceptions

CS203 – Advanced Computer Architecture

Lecture: Out-of-order Processors

Microprocessor Microarchitecture Dynamic Pipeline

Lecture 6: Advanced Pipelines

Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Lecture 10: Out-of-order Processors

Lecture 11: Out-of-order Processors

Lecture: Out-of-order Processors

Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Smruti R. Sarangi IIT Delhi

Lecture: Out-of-order Processors

Lecture 8: Dynamic ILP Topics: out-of-order processors

Adapted from the slides of Prof

15-740/ Computer Architecture Lecture 5: Precise Exceptions

Krste Asanovic Electrical Engineering and Computer Sciences

Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)

Control unit extension for data hazards

Adapted from the slides of Prof

Control unit extension for data hazards

Control unit extension for data hazards

Lecture 9: Dynamic ILP Topics: out-of-order processors

Conceptual execution on a processor which exploits ILP

Presentation transcript:

CDA 5155 Out-of-order execution: Pentium Pro/II/III Week 7

Executing IA32 instructions fast Problem: Complex instruction set Solution: Break instructions up into RISC- like micro operations –Lengthens decode stage; simplifies execute

Pentium Pro/II/III Process Stages The first stage consists of the instruction fetch, decode, convert into micro-ops, and reg rename The reorder buffer (ROB) is the buffer between the first and second stages The ROB is also the buffer between the second and third stages The third stage retires the micro-operations in original program order –Completed micro-operations wait in the reorder buffer until all of the preceding instructions have been retired

MEM IFID REN EX ROB CT In-order Any order ARF Alloc Rename Table regIDrobIDX Fetch (2 cycles) read instructions (16 bytes) from memory from IP Decode (3 cycles) Decode up to 3 instructions generating up to 6  ops Decoder can handle 2 “simple” instructions and 1 “complex” instruction. Rename (1 cycle) Index table with source operand regID to locate ROB/ARF Alloc Allocate ROB entry at Tail Rename Table –Indexed with regID –Returns (valid, robIDX) –If valid, ROB does/will contain value of register –If invalid, ARF holds value (no instruction in flight defines this register) HeadTail Pentium Pro pipeline overview

MEM Pentium Pro pipeline Execute (parallel) –Wait for sources (schedule) –Execute instruction (ex) –Write back result to Commit –Wait until Head is done –If fault, initiate handler –Else, write results to ARF –Deallocate entry from ROB EX ROB CT HeadTail PC Dst regID Dst value Except? Reorder Buffer (ROB) –Circular queue of spec state –May contain multiple definitions of same register In-order Any order ARF IFID RENAlloc

Register Renaming Example Logical ProgramPhysical Program r6 = r5 + r2 r8 = r6 + r3 r6 = r9 + r10 r12 = r8 + r Logical ProgramPhysical Program r6 = r5 + r2p52 = p45 + p42 r8 = r6 + r3 r6 = r9 + r10 r12 = r8 + r p42 p45 x x p42 p45 p52 x x x

Register Renaming Example Logical ProgramPhysical Program r6 = r5 + r2p52 = p45 + p42 r8 = r6 + r3p53 = p52 + r3 r6 = r9 + r10 r12 = r8 + r Logical ProgramPhysical Program r6 = r5 + r2p52 = p45 + p42 r8 = r6 + r3p53 = p52 + r3 r6 = r9 + r10p54 = r9 + r10 r12 = r8 + r p42 p45 p53 x x x p42 p45 p54 p53 x x x x p52 x

Register Renaming Example Logical ProgramPhysical Program r6 = r5 + r2p52 = p45 + p42 r8 = r6 + r3p53 = p52 + r3 r6 = r9 + r10p54 = r9 + r10 r12 = r8 + r6p55 = p53 + p p45 p54 p53 p55 x x x x x p42

Cross-cutting Issue: Mispeculation What are the impacts of mispeculation or exceptions? –When instructions are flushed from the pipeline, rename mappings must be restored to point-of-restart –Otherwise, new instructions will see stale definitions Two recovery approaches –Simple/slow 1.Wait until the faulting/mispredicting instruction reaches retirement 2.Flush ALL speculative register definitions by clearing all rename table valid bits –Complex/fast 1.Checkpoint ENTIRE rename table anywhere recovery may be needed 2.At soon as mispeculation detected, recover table associated with PC

Discussion Points What are the trade-offs between rename table flush recovery and checkpointing? What if another instruction (being renamed) needs to access a physical storage entry after it has been overwritten? Can I rename memory?

MEM Reorder Alloc –Allocate result storage at Execute –Get inputs (ROB T-to-H then ARF) –Wait until all inputs ready –Execute WB –Write results/fault to ROB –Indicate result is CT –Wait until Head is done –If fault, initiate handler –Else, write results to ARF –Deallocate entry from ROB IFID RENalloc EX ROB CT HeadTail PC Dst regID Dst value Except? Reorder Buffer (ROB) –Circular queue of spec state –May contain multiple definitions of same register In-order Any order ARF

MEM Dynamic Instruction Scheduling IFID RENalloc EX ROB CT In-order Any order ARF Alloc –Allocate ROB storage at Tail –Allocate RS for REG –Get inputs from ROB/ARF entry specified by REN –Write instruction with available operands into assigned WB –Write result into ROB entry –Broadcast result into RS with phyID of dest register –Dellocate RS entry (requires maintenance of an RS free map) Reservation Stations (RS) –Associative storage indexed by phyID of dest, returns insts ready to execute –phyID is ROB index of inst that will compute operand (used to match on broadcast) –Value contains actual operand –Valid bits set when operand is available (after broadcast) RS Any order phyID V V Value dstIDOp WB

Wakeup-Select-Execute Loop MEM EX RS WB src 1 val 1 src 2 val 2 dstID src 1 val 1 src 2 val 2 dstID src 1 val 1 src 2 val 2 dstID req grant Selection Logic == == == dstIDresult To EX/MEM

Window Size vs. Clock Speed Increasing the number of RS [Brainiac] –Longer broadcast paths –Thus more capacitance, and slower signal propagation –But, more ILP extracted Decreasing the number of RS [Speed Demon] –Shorter broadcast paths –Thus less capacitance, and faster signal propagation –But, less ILP extracted Which approach is better and when?

Cross-cutting Issue: Mispeculation What are the impacts of mispeculation or exceptions? –When instructions are flushed from the pipeline, their RS entries must be reclaimed –Otherwise, storage leaks in the microarchitecture This can happen, Alpha reportedly flushes the instruction window to reclaim all RS resources every million or so cycles The PIII processor reportedly contains a livelock/deadlock detector that would recover this failure scenario Typical recovery approach –Checkpoint free map at potential fault/mispeculation points –Recover the RS free map associated with recovery PC

Optimizing the Scheduler Optimizing Wakeup –Value-less reservation stations Remove register values from latency-critical RS structures –Pipelined schedulers Transform wakeup-select-execute loop to wakeup-execute loop –Clustered instruction windows Allow some RS to be “close” and other “far away”, for a clock boost Optimizing Selection –Reservation station banking Associate RS groups with a FU, reduces the complexity of picking

Value-less Reservation Stations Q: Do we need to know the value of a register to schedule its dependent operations? –A: No, we simply need dependencies and latencies Value-less RS only contains required info –Dependencies specified by physical register IDs –Latency specified by opcode Access register file in a later stage, after selection Reduces size of RS, which improves broadcast speed MEM IFID RENalloc EX ROB CT In-order Any order ARF REG RS Any order WB phyID V V dstIDOp

Value-less Reservation Stations MEM EX RS WB src 1 src 2 dstID src 1 src 2 dstID src 1 src 2 dstID req grant Selection Logic == == == dstID To EX/MEM

Pipelined Schedulers Q: Do we need to know the result of an instruction to schedule its dependent operations? –A: Once again, no, we need know only dependencies and latency To decouple wakeup-select loop –Broadcast dstID back into scheduler N-cycles after inst enters REG, where N is the latency of the instruction What if latency of operation is non-deterministic? –E.g., load instructions (2 cycle hit, 8 cycle miss) –Wait until latency known before scheduling dependencies (SLOW) –Predict latency, reschedule if incorrect Reschedule all vs. selective MEM IFID RENalloc EX ROB CT In-order Any order ARF REG RS Any order WB phyID V V dstIDOp

Pipelined Schedulers MEM EX RS WB src 1 src 2 dstID src 1 src 2 dstID src 1 src 2 dstID req grant Selection Logic == == == dstID To EX/MEM timer

Clustered Instruction Windows Split instruction window into execution clusters –W/N RS per cluster, where W is the window size, N is the # of clusters –Faster broadcast into split windows –Inter-cluster broadcasts take at least an one more cycle Instruction steering –Minimizes inter-cluster transfers –Integer/Floating point split –Integer/Address split –Dependence-based steering Single Cycle Broadcast Single Cycle Broadcast Single Cycle Broadcast Single Cycle Inter-Cluster Broadcast I-steer

Reservation Station Banking Split instruction window into banks –Group of RS associated with FU –Faster selection within bank Instruction steering –Direct instructions to bank associated with instruction opcode Trade-offs with banking –Fewer selection candidates speeds selection logic, which is O(log W) –But, splits RS resources by FU, increasing the risk of running out of RS resources in ALLOC stage Unified RS Pool RS Bank for FU #1 RS Bank for FU #2 I-steer Selection Logic Selection Logic Selection Logic

Discussion Points If we didn’t rename the registers, would the dynamic scheduler still work? We can deallocate RS entries out-of-order (which improves RS utilization), why not allocate them out-of-order as well? What about memory dependencies?

Memory dependence issues in an out-of-order pipeline Out-of-order memory scheduling –Dependencies are known only after address calculation. –This is handled in the Memory-order-buffer (MOB) When can memory operations be performed out-of-order? –What does the MOB have to do to insure that?

Effects of Speculation in an out- of-order pipeline What happens when a branch mis-predicts? –When should this be recognized? –What needs to be cleaned up? MEM ID RENalloc EX ROB CT In-order ARF REG RS WB

Structure that must be updated after a branch misprediction. ROB HeadTail ROB –Set tail to head to delete everything Rename table –Mark all entries as invalid (correct values are in the ARF) Reservation stations –Free all reservation station entries MOB –Free all MOB entries –Correctly handle any outstanding memory operations. Rename Table regIDrobIDX v