Out-of-Order Speculative Execution Designing a Configurable Simulator for an OOO Microprocessor By Mustafa Imran Ali ID# 230203.

Out-of-Order Speculative Execution Designing a Configurable Simulator for an OOO Microprocessor By Mustafa Imran Ali ID# 230203

COE 501 Presentation by Mustafa Imran Ali2 Presentation Outline Introduction Examples - Representative Micro- architectures Some Issues - Limitations and Other Approaches Simulator Details

COE 501 Presentation by Mustafa Imran Ali3 Out-of-order Speculative Execution – Maximizing ILP In-order Execution  Pipelining – exploiting temporal parallelism through overlap  Superscalar – more parallelism by allowing multiple instructions to issue Problem – Pipeline Stalls  Data dependencies allow limited ILP  Large latency functions cause structural hazards  Data loads - Cache miss stalls

COE 501 Presentation by Mustafa Imran Ali4 Out-of-order Speculative Execution instructions execute as soon as possible and in parallel with other nondependent work  results in faster execution because critical-path computations start and complete quickly speculatively fetch and execute instructions even though it may not know immediately whether the instructions will be on the final execution path  Multilevel Branch prediction to avoid waiting for outcome of multiple branches

COE 501 Presentation by Mustafa Imran Ali5 OOO Speculative Execution - Benefits Reduced reliance on compilers  Compilers are cannot examine runtime dependencies No need for recompilation  Source code access not always possible  Binary compatibility with existing code

COE 501 Presentation by Mustafa Imran Ali6 OOO Speculative Execution - Problems and Issues Overcoming WAW and WAR hazards – Register Renaming More branches/cycle – accurate branch prediction Register Renaming – Dependency checking mechanism (Large comparisions) Data forwarding from producers to consumers – use of tagging and broadcast mechanism Exceptions – Committing instructions in program order

COE 501 Presentation by Mustafa Imran Ali7 Compaq Alpha 21264 (1998) OOO superscalar with speculative execution  Fetches 4 instructions/cycle  Dynamically issues up to 6 instructions/cycle: 4 integer and 2 floating point  Can speculate through up to 20 branches  64 architectural register  41 integer + 41 floating point rename register  Up to 80 instructions in-flight + 32 in-flight loads + 32 in- flight stores  20-entry integer queue  Issues 4 instructions  15-entry floating point queue  Issues 2 instructions  Can retire at most 11 instructions/cycle, can sustain a rate of 8/cycle (over short periods)

COE 501 Presentation by Mustafa Imran Ali8 Stages in Instruction Pipeline Provides 4 instructions/cycle Maps virtual register to physical registers Dynamically selects from up to 6 instructions – Issue reordering takes place All pipeline stages subsequent to the register map stage operate on internal registers rather than user-visible registers

COE 501 Presentation by Mustafa Imran Ali9 Register Renaming Process assigns a unique storage location with each write-reference to a register speculatively allocates a register to each instruction with a register result register only becomes part of the user- visible (architectural) register state when the instruction retires/commits allows instruction to speculatively issue and deposit its result into the register file before the instruction retires

COE 501 Presentation by Mustafa Imran Ali10 Register Renaming Process (continued) processor maintains storage with each internal register indicating the user-visible register that is currently associated with the given internal register (if any) register renaming is a content-addressable memory (CAM) operation for register sources together with a register allocation for the destination register register mapper stores the register map state for each in-flight instruction so that the machine architectural state can be restored in case a misspeculation occurs

COE 501 Presentation by Mustafa Imran Ali11 Map (register rename) and Queue Stages The map stage renames programmer-visible register numbers to internal register numbers The queue stage stores instructions until they are ready to issue structures are duplicated for integer and floating point execution

COE 501 Presentation by Mustafa Imran Ali12 Out-of-order Issue Queues issue queue logic maintains 2 lists of pending instructions in separate integer and floating-point queues scoreboards maintain status of the internal registers by tracking the progress of single- cycle, multiple-cycle, and variable-cycle (memory load) instructions the scoreboard unit notifies all instructions in the queue that require the register value when functional unit or load-data results become available

COE 501 Presentation by Mustafa Imran Ali13 Out-of-order Execution Each queue/arbiter selects the oldest operand-ready and functional-unit- ready instructions for execution each cycle queues are collapsable—an entry becomes immediately available once the instruction issues or is squashed due to misspeculation

COE 501 Presentation by Mustafa Imran Ali14 Retire Mechanism assigns each mapped instruction a slot in a circular in-flight window (in fetch order) tracks the internal register usage for all in- flight instructions each entry in the mechanism contains storage indicating the internal register that held the old contents of the destination register for the corresponding instruction this (stale) register can be freed for other use after the instruction retires

COE 501 Presentation by Mustafa Imran Ali15 Exception Handling exception causes all younger instructions in the in-flight window to be squashed and are removed from all queues in the system register map is backed up to the state before the last squashed instruction using the saved map state registers allocated by the squashed instructions become immediately available

COE 501 Presentation by Mustafa Imran Ali16 HP PA-RISC 8000

COE 501 Presentation by Mustafa Imran Ali17 ROB Size Performance Effect

COE 501 Presentation by Mustafa Imran Ali18 AMD K-5 ROB Entry

COE 501 Presentation by Mustafa Imran Ali19 AMD K-5 Reservation Station Entry

COE 501 Presentation by Mustafa Imran Ali20 Approaches for Billion Transistor Architectures Advanced superscalar processors  scale up from current designs to issue 16 or 32 instructions per cycle Superspeculative processors  enhance wide-issue superscalar performance by speculating aggressively at every point in the processor pipeline

COE 501 Presentation by Mustafa Imran Ali21 SPARC64 V9

COE 501 Presentation by Mustafa Imran Ali22 Pentium III and 4 Register Renaming and ROB

COE 501 Presentation by Mustafa Imran Ali23 One BillionTransistors, One Uniprocessor, One Chip?

COE 501 Presentation by Mustafa Imran Ali24 Superspeculative Architecture

COE 501 Presentation by Mustafa Imran Ali25 Area Issues A large circuitry required to feed the processors with a continuous instructions stream Dynamic execution requires a large amount of comparisons for dependency checking The size of reorder buffer, reservation stations/rename registers increase accordingly

COE 501 Presentation by Mustafa Imran Ali26 Limitations Larger issue machines have high peak to sustained rate ratios – Intel Pentium Pro architecture Approach Beyond issue widths of 8, inherent limited ILP in single-thread, give diminishing returns – More architectures switching to Simultaneous Multithreading

COE 501 Presentation by Mustafa Imran Ali27 Alternate Approaches ApproachIssue Structure Hazard detection SchedulingCommentExamples Speculative Superscalar DynamicHardwareDynamic with Speculation OOO with Speculation Pentium II/III/IV, Alpha 21264 VLIWStaticSoftwareStaticNo hazard between issue packets MAJC EPICMostly staticMostly software Mostly staticExplicit dependences marked by compiler Itanium

COE 501 Presentation by Mustafa Imran Ali28 OOO Speculative Execution Processor - Simulator Design Tracking all the activities of the pipelined machine in each clock cycle Issue Unit design that solves structural and data hazards Dependency checking Mechanisms Strategy for sending data from producers to consumers

COE 501 Presentation by Mustafa Imran Ali29 Data Structures Instruction Queue Execution Tracking Hardware Structure  Register File Producer Table  Reservation Stations  The Reorder Buffer Functional Units State Structure

COE 501 Presentation by Mustafa Imran Ali30 Service Functions Issue Dispatch Completion CDB Snooping Retirement and Writeback

COE 501 Presentation by Mustafa Imran Ali31 Overall Structure

COE 501 Presentation by Mustafa Imran Ali32 Producer Table Each register is extended by a tag and valid flag  Valid=true iff register contains appropriate data  Other tag points to instruction producing the data

COE 501 Presentation by Mustafa Imran Ali33 Reservation Stations Full bit is set if entry occupied Tag points to ROB tag of the instruction op1 and op2 hold the source references

COE 501 Presentation by Mustafa Imran Ali34 The Reorder Buffer Realized as a FIFO with ROBhead and ROBtail New instructions put at ROBtail and instruction is tagged in RS with this. Each cycle the ROBhead valid entry is checked for instruction completion

COE 501 Presentation by Mustafa Imran Ali35 Issue Protocol if (there is a free RS and a free ROB entry) { RS.full:=1; RS.tag:=ROBtail; for all operands x of Ii with address r if Rr.valid=1 RS.opx:=Rr; else if CDB.tag=Rr.tag and CDB.valid RS.opx:=CDB; else RS.opx:=ROB[Rr.tag]; if ( Ii has a destination register r) Rr.tag:=ROBtail; Rr.valid=0; ROB[ROBtail].dest:=r; else ROB[ROBtail].dest:=none; ROBtail:=ROBtail+1; }

COE 501 Presentation by Mustafa Imran Ali36 Dispatch Protocol if there is a RS with RS.opx.valid=1 for all operands x and the function unit is not stalled { Pass instruction, operands, and tag to FU RS.full:=0; }

COE 501 Presentation by Mustafa Imran Ali37 Completion Protocol if FU has result and got CDBacknowledge { CDB.valid:=1; CDB.data:=result from FU; CDB.tag:=tag from FU; ROB[CDB.tag].valid:=1; ROB[CDB.tag].data:=CDB.data; }

COE 501 Presentation by Mustafa Imran Ali38 CDB Snooping For all operands x: if RS.full=1 and RS.opx.valid=0 and RS.opx.tag=CDB.tag { RS.opx:=CDB; }

COE 501 Presentation by Mustafa Imran Ali39 Retirement/Writeback Protocol if ROB not empty and ROB[ROBhead].valid=1 { if instruction in the ROB[ROBhead] requires writeback { x:=ROB[ROBhead].dest; Rx.data:=ROB[ROBhead].data; if ROBhead=Rx.tag Rx.valid=1; } ROBhead:=ROBhead+1; }

COE 501 Presentation by Mustafa Imran Ali40 Configurable Parameters Probability of memory misses Probability of correct branch prediction Branch mis-prediction penalty Cache miss penalty Window Size for instruction issue Number of Issues/cycle Number of Functional Units (FUs) Pipeline Depth/Latency of each FU Number of CDBs Size of reservation stations/rename registers (RS) Operand matching mechanism in each RS Size of re-order buffer Branch Prediction Mechanisms (optional)

COE 501 Presentation by Mustafa Imran Ali41 Performance Metrics Number of Clock cycles on an instruction trace Number of Stalls (Various Types) Effect on Hardware costs Peak vs. Sustained Rates (actual issues vs. maximum possible) Percentage Resource Utilization

COE 501 Presentation by Mustafa Imran Ali42 OOO Speculative Micro- architecture Simulators Simple Scalar  University of Wisconsin in Madison  www.simplescalar.com KScalar  Universidad Autónoma de Barcelona  www.caos.uab.es/kscalar

COE 501 Presentation by Mustafa Imran Ali43 Simple Scalar v3.0 tool set includes sample simulators ranging from a fast functional simulator to a detailed, dynamically scheduled processor model that supports non- blocking caches, speculative execution, and state- of-the-art branch prediction includes performance visualization tools, statistical analysis resources, and debug and verification infrastructure includes a machine definition infrastructure that permits most architectural details to be separated from simulator implementations

COE 501 Presentation by Mustafa Imran Ali44 KScalar allows analyzing the performance behavior of a wide range of processor microarchitectures: from a very simple in-order, scalar pipeline, to a detailed out-of-order, superscalar pipeline with non-blocking caches, speculative execution, and complex branch prediction The simulator interprets executables for the Alpha AXP instruction set: from very short program fragments to large applications The object's program execution may be simulated in varying levels of detail: either cycle-by-cycle, observing all the pipeline events that determine processor performance, or million cycles at once, taking statistics of the main performance issues

COE 501 Presentation by Mustafa Imran Ali45 Study Direction Modeling and comparison of representative Micro-architectures  Parameters modeling commercial micro- architecture’s OOO speculative execution core  SPEC benchmarks instruction traces  analysis of relative importance of supporting assumptions

COE 501 Presentation by Mustafa Imran Ali46 Study Direction (continued) Modeling Resource Utilization of Simultaneous Multithreaded Workload  Comparison of resource utilization and performance metrics of single-thread vs. SMT execution  Use of instruction traces that model multi- thread workload (e.g. modeling Hyperthreading in Pentium 4)

Out-of-Order Speculative Execution Designing a Configurable Simulator for an OOO Microprocessor By Mustafa Imran Ali ID# 230203.

Similar presentations

Presentation on theme: "Out-of-Order Speculative Execution Designing a Configurable Simulator for an OOO Microprocessor By Mustafa Imran Ali ID# 230203."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Out-of-Order Speculative Execution Designing a Configurable Simulator for an OOO Microprocessor By Mustafa Imran Ali ID# 230203.

Similar presentations

Presentation on theme: "Out-of-Order Speculative Execution Designing a Configurable Simulator for an OOO Microprocessor By Mustafa Imran Ali ID# 230203."— Presentation transcript:

Similar presentations

About project

Feedback