CDA 5155 Out-of-order execution: Advanced pipelines.

CDA 5155 Out-of-order execution: Advanced pipelines

Implications of Superscalar Execution Instruction fetch? –Taken branches, multiple branches, partial cache lines Instruction decode? –Simple for fixed length ISA, much harder for variable length Renaming? –Multi-port RT, inter-inst dependencies must be recognized Dynamic Scheduling? –Requires multiple results buses, smarter selection logic Execution? –Multiple functional units, multiple result buses Commit? –Multiple ROB/ARF ports, dependencies must be recognized

P4 Overview More aggressive processor –Equipped with the full set of iA32 SIMD operations –First flagship architecture since the PPro microarchitecture –Pentium 4 ISA = Pentium III ISA + SSE2 –SSE2 (Streaming SIMD Extensions 2) provides 128-bit SIMD integer and floating point operations + prefetch

Execution Pipeline

Front End Predicts branches Fetches/decodes code into trace cache Generates  ops for complex instructions Prefetches instructions that are likely to be executed

Branch Prediction Dynamically predict the direction and target of branches based on PC using BTB If no dynamic prediction available, statically predict –Taken for backwards looping branches –Not taken for forward branches –Implemented at decode Traces built across (predicted) taken branches to avoid taken branch penalties Also includes a 16-entry return address stack predictor

Decoder Single decoder available –Operates at a maximum of 1 instruction per cycle Receives instructions from L2 cache 64 bits at a time Some complex instructions must enlist the micro-ROM –Used for very complex iA32 instructions (> 4  ops) –After the microcode ROM finishes, the front-end resumes fetching  ops from the Trace Cache

Execution Pipeline

Trace Cache Primary instruction cache in P4 architecture –Stores 12k decoded  ops On a miss, instructions are fetched from L2 Trace predictor connects traces Trace cache removes –Decode latency after mispredictions –Decode power for all pre-decoded instructions

Branch Hints P4 software can provide hints to branch prediction and trace cache –Specify the likely direction of a branch –Implemented with conditional branch prefixes –Used for decode-stage predictions and trace building

Execution Pipeline

Execution 126  ops can in flight at once –Up to 48 loads / 24 stores Can dispatch up to 6  ops per cycle 2x trace cache and retirement  op bandwidth –Provides additional B/W for scheduling mispeculation

Execution Units

Store and Load Scheduling Out of order store and load operations Stores are always in program order 48 loads and 24 stores could be in flight Store/load buffers are allocated at the allocation stage –Total 24 store buffers and 48 load buffers

Execution Pipeline

Retirement Can retire 3  ops per cycle Implements precise exceptions Reorder buffer used to organize completed  ops Also keeps track of branches and sends updated branch information to the BTB

Data Stream of Pentium 4 Processor

On-chip Caches L1 instruction cache (Trace Cache) L1 data cache L2 unified cache –All caches use a pseudo-LRU replacement algorithm Parameters:

L1 Data Cache Non-blocking –Support up to 4 outstanding load misses Load latency –2-clock for integer –6-clock for floating-point 1 Load and 1 Store per clock Load speculation –Assume the access will hit the cache –“Replay” the dependent instructions when miss detected

L2 Cache Non-blocking Load latency –Net load access latency of 7 cycles Bandwidth –1 load and 1 store in one cycle –New cache operations may begin every 2 cycles –256-bit wide bus between L1 and L2 –48Gbytes per second @ 1.5GHz

L2 Cache Data Prefetcher Hardware prefetcher monitors the reference patterns Bring cache lines automatically Attempts to fetch 256 bytes ahead of current access Prefetch for up to 8 simultaneous independent streams

Execution on MPEG4 Benchmarks @ 1 GHz

Problems with even faster clocks Power goes up (frequency Х voltage 2 ) –Unfortunately frequency is tied to voltage. Hard to break some operation into pipelined pieces –Can’t break all critical paths Pipeline register writes use part of clock –And as frequency grows, pipeline reg writes use more of the clock time Clock Skew –Clocks to different registers will not be perfectly aligned

AMD opteron

AMD Hammer Microarchitecture 12 Stage pipeline Pre-decode instruction mem –With ID bits to identify branch instructions and the first byte of all instructions Partitioned Register file Bigger data cache memory

64 bit Architectural Extensions

Classical Pipelining Synchronous digital circuit Partition combination logic into stages Insert pipeline registers between stages Pipeline register

Classical Pipelining - Problems For max performance, all stages must be busy all the time. –How many LC2K3 instructions do something useful each stage? Logic divided equally so all computations finish at exactly the same time. –How long does it take to complete the LC2K1 decode stage? Very deep pipelines have a lot of overhead writing to the pipeline registers.

Wave Pipelining Also referred to as maximal rate pipelining Allows multiple data waves simultaneously between successive storage elements (registers or pipeline registers). –So pipeline register are not needed. Uses clock period that is less than max propagation delay between the registers.

Wave Pipelining (Cont.) Data at input is changed before previous data has completely propagated through to output. Picture a water slide… Cycle time

Wave Pipelining Example Min delay of 16, max delay of 20

Wave Pipelining – Maximizing Clock Rate Minimum cycle time limited by difference between min and max Input-Output delays (and device switching speed). For max clock rate - must equalize all path delays from input to output. Factors: –Topological path differences. –Process/temperature/power variations. –Data-dependent delay variations. Intentional clock skew?

Wave Pipelining - Problems Operating speed constrained to narrow range of frequencies for given degree of wave pipelining. New fabrication process requires significant redesign No effective mechanism for starting/stopping: –Pipeline stalls, low speed testing? In general, very hard to do circuit analysis.

Benefits of Register Communication Directly specified dependencies (contained within instruction) –Accurate description of communication No false or missing dependency edges Permits realization of dataflow schedule –Early description of communication Allows scheduler pipelining without impacting speed of communication Small communication name space –Fast access to communication storage Possible to map/rename entire communication space (no tags) Possible to bypass communication storage

Why Memory Scheduling is Hard (Or, Why is it called HARDware?) Loads/stores also have dependencies through memory –Described by effective addresses Cannot directly leverage existing infrastructure –Indirectly specified memory dependencies Dataflow schedule is a function of program computation, prevents accurate description of communication early in the pipeline Pipelined scheduler slow to react to addresses –Large communication space (2 32-64 bytes!) cannot fully map communication space, requires more complicated cache and/or store forward network ? *p = … *q = … … = *p

Requirements for a Solution Accurate description of memory dependencies –No (or few) missing or false dependencies –Permit realization of dataflow schedule Early presentation of dependencies –Permit pipelining of scheduler logic Fast access to communication space –Preferably as fast as register communication (zero cycles)

In-order Load/Store Scheduling Schedule all loads and stores in program order –Cannot violate true data dependencies (non- speculative) Capabilities/limitations: –Not accurate - may add many false dependencies –Early presentation of dependencies (no addresses) –Not fast, all communication through memory structures Found in in-order issue pipelines st X ld Y st Z ld X ld Z truerealized Dependencies program order

ld Y st X In-order Load/Store Scheduling Example st X ld Y st Z ld X ld Z truerealized Dependencies program order time ld Y st Z ld X ld Z st Z ld X ld Z ld Y st X st Z ld X ld Z ld Y st X st Z ld X ld Z ld Y st X st Z ld X ld Z

Blind Dependence Speculation Schedule loads and stores when register dependencies satisfied –May violate true data dependencies (speculative) Capabilities/limitations: –Accurate - if little in-flight communication through memory –Early presentation of dependencies (no dependencies!) –Not fast, all communication through memory structures Most common with small windows st X ld Y st Z ld X ld Z truerealized Dependencies program order

ld Y st X Blind Dependence Speculation Example st X ld Y st Z ld X ld Z truerealized Dependencies program order time ld Y st Z ld X ld Z st Z ld X ld Z ld Y st X st Z ld X ld Z ld Y st X st Z ld X ld Z ld Y st X st Z ld X ld Z mispeculation detected!

Discussion Points Suggest a way to detect blind load mispeculation Suggest a way to recover from blind load mispeculation

The Case for More/Less Accurate Dependence Speculation Small windows: blind speculation is accurate for most programs, compiler can register allocate most short term communication Large windows: blind speculation performs poorly, many memory communications in execution window [For 099.go, from Moshovos96]

Conservative Dataflow Scheduling Schedule loads and stores when all dependencies known satisfied –Conservative - won’t violate true dependencies (non-speculative) Capabilities/limitations: –Accurate only if addresses arrive early –Late presentation of dependencies (verified with addresses) –Not fast, all communication through memory and/or complex store forward network Common for larger windows st X ld Y st?Z ld X ld Z truerealized Dependencies program order

ld Y st X Conservative Dataflow Scheduling st X ld Y st?Z ld X ld Z truerealized Dependencies program order time ld Y st?Z ld X ld Z st?Z ld X ld Z ld Y st X st?Z ld X ld Z ld Y st X st Z ld X ld Z ld Y st X st Z ld X ld Z ld Y st X st Z ld X ld Z Z stall cycle

Discussion Points What if no dependent store or unknown store address is found? Describe the logic used to locate dependent store instructions What is the tradeoff between small and large memory schedulers? How should uncached loads/stores be handled? Video RAM?

Memory Dependence Speculation [Moshovos96] Schedule loads and stores when data dependencies satisfied –Uses dependence predictor to match sourcing stores to loads –Doesn’t wait for addresses, may violate true dependencies (speculative) Capabilities/limitations: –Accurate as predictor –Early presentation of dependencies (data addresses not used in prediction) –Not fast, all communication through memory structures st?X ld Y st?Z ld X ld Z truerealized Dependencies program order

Dependence Speculation - In a Nutshell Assumes static placement of dependence edges is persistent –Good assumption! Common cases: –Accesses to global variables –Stack accesses –Accesses to aliased heap data Predictor tracks store/load PCs, reproduces last sourcing store PC given load PC A: *p = … B: *q = … C: … = *p Dependence Predictor CA or B

ld Y st?X Memory Dependence Speculation Example st?X ld Y st?Z ld X ld Z truerealized Dependencies program order time ld Y st?Z ld X ld Z st?Z ld X ld Z ld Y st?X st?Z ld X ld Z ld Y st X st?Z ld X ld Z ld Y st X st?Z ld X ld Z X

Memory Renaming [Tyson/Austin97] Design maxims: – Registers Good, Memory Bad – Stores/Loads Contribute Nothing to Program Results Basic idea: –Leverage dependence predictor to map memory communication onto register synchronization and communication infrastructure Benefits: –Accurate dependence info if predictor is accurate –Early presentation of dependence predictions –Fast communication through register infrastructure

st X ld Y st Z ld X ld Z I1 I4 I5 I2 Memory Renaming Example Renamed dependence edges operate at bypass speed Load/store address stream becomes “checker” stream –Need only be high-B/W (if predictor performs well) –Risky to remove memory accesses completely st X ld Y st Z ld X ld Z I1ld Y I4 I5 I2

Memory Renaming Implementation Speculative loads require recovery mechanism Enhancements muddy boundaries between dependence, address, and value prediction –Long lived edges reside in rename table as addresses –Semi-constants also promoted into rename table Dependence Predictor store/load PC’s predicted edge name (5-9 bit tag) ID Edge Rename Table) REN physical storage assignment (destination for stores, source for loads) one entry per edge

Experimental Evaluation Implemented on SimpleScalar 2.0 baseline Dynamic scheduling timing simulation (sim-outorder) –256 instruction RUU –Aggressive front end –Typical 2-level cache memory hierarchy Aggressive memory renaming support –4k entries in dependence predictor –512 edge names, LRU allocated Load speculation support –Squash recovery –Selective re-execution recovery

Dependence Predictor Good coverage of “in-flight” communication Lots of room for improvement

Program Performance Performance predicated on: –High-B/W fetch mechanism –Efficient mispeculation recovery mechanism Better speedups with: –Larger execution windows –Increased store forward latency –Confidence mechanism

Additional Work Turning of the crank - continue to improve base mechanisms –Predictors (loop carried dependencies, better stack/global prediction) –Improve mispeculation recovery performance Value-oriented memory hierarchy Data value speculation Compiler-based renaming (tagged stores and loads): store r1,(r2):t1 store r3,(r4):t2 load r5,(r6):t1

Scalable Microarchitectures Traditional microarchitecture designs are running out of gas… Many research opportunities –High-B/W fetch architectures (dual path execution, multiple-branch predictors) –Pipeline schedulers (partitioned resources, run-ahead processing) –Fast execution cores/memory (memory renaming, region caching) –Power-efficient microa rchitectures (ASP) IFIDRENREG in-order fetch SCHEDULER in-order retirement EX/ MEM out-of-order execute non-speculative inputs CT Reg/Mem & Caches Diminishing ILP Fetch Starved Slow Wires Slow Memory Unreliable Logic Increasing Power

Microprocessor Verification The fatalist’s approach to microprocessor verification! Core technology: dynamic verification (DIVA: Todd Austin) –Simple (and correct) checker processor verifies all results before retirement –Reduces the burden of correctness on the core processor design –Core processor relegated to branch/value prediction and cache prefetch Fundamentally changes the design of a complex microprocessor –Complete formal verification feasible-- Low-cost SER protection –Beta-release microprocessors-- Self-tuned digital circuits speculative instructions in-order with inputs and outputs Fault Tolerant CoreChecker IFIDRENREG EX/ MEM SCHEDULER CHK Reg/Mem & Caches non-speculative inputs CT

The Burden of Verification Immense test space –Impossible to fully test the system –For example, 32 regs, 8k caches, 300 pins = 2 132396 states –Conservative estimate, microarchitectural state increases the test space Done with respect to ill-defined reference –What is correct? Often defined by PRM + old designs + guru guidance Expensive –Large fraction of design team dedicated to verification –Increases time-to-market, often as much as 1-2 years High-risk –Typically only one chance to “get it right” –Failures can be costly: replacement parts, bad PR, lawsuits, fatalities

Simulation Based Verification Determines if design is functionally correct at the logic level Implemented with co-simulation of “important” test cases –Mostly before tape out using RTL/logic level simulators Differences found at output drive debug Process continues until “sufficient” coverage of test space “important” test cases uArch Model Reference Model (ISA sim) == output Test OK?

Formal Verification Formal verification speeds testing by comparing models –Compare reference and uArch model using formal methods (e.g., SAT) –If models shown functionally equivalent, any program renders same result –Much better coverage than simulation-based verification Unfortunately, intractable task for complete modern pipeline –Problems: imprecise state, microarchitectural state, out-of-order operations –Machines we build are not functionally equivalent to reference machine! X uArch Model Reference Model (ISA sim) == state Always true if uArch model == Ref model Identical state?

Deep Submicron Reliability Challenges More difficult to build robust systems in denser technologies –Degraded signal quality Increased interconnect capacitance results in signal crosstalk Reduced supply voltage degrades noise immunity Increased current demands create supply voltage noise –Single event radiation/soft errors (SER) Alpha particles (from atomic impurities) and gamma rays (from space) Energetic particle strikes destroy charge, may switch small transistors Inexpensive shielding solutions unlikely to materialize –Increased complexity More transistors will likely mean greater complexity Verification demands and probability of failure will increase

Motivating Observations Speculative execution is fault-tolerant –Design errors, timing errors, and electrical faults only manifest as performance divots –Correct checking mechanism will fix errors What if all computation, communication, control, and progress were speculative? –Any incorrect computation fixed maximally speculative –Any core fault fixed minimally correct X PC always not taken stuck-at fault branch predictor array

Motivating Observations (continued) Reliable online functional verification will cover most faults –Single-event upsets –Design faults and incomplete implementation –Data-dependent and noise-related electrical faults –Untestable silicon defects and in field circuit failures – We utilize an simple hardware approach to detect and correct faults Increasing the degree of speculation reduces exposure to faults –Predictors need not be fully correct, either functionally or electrically – Our approach leverages a maximally speculative architecture Processors have complex implementations, yet simple semantics –Need not validate the internal workings, only exposed semantics – We only check instruction semantics to keep overheads low

Core computation, communication, and control validated by checker – –Instructions verified by checker in program order before retirement – –Checker detects and corrects faulty results, restarts core Checker relaxes the burden of correctness on the core processor – –Robust checker corrects faults in any core structure not used by checker – –Tolerates core design errors, electrical faults, silicon defects, and failures – –Core only has burden of high accuracy prediction Key checker requirements: simple, fast, and reliable Dynamic Verification: Seatbelts for Your CPU speculative instructions in-order with PC, inst, inputs, addr Complex Core ProcessorChecker Processor IFIDRENREG EX/ MEM SCHEDULER CHKCT

result Checker Processor Architecture IF ID CT OK Core Processor Prediction Stream PC = inst PC inst EX = regs core PC core inst core regs MEM = res/addr addr core res/addr/nextPC result D-cache I-cache RF WT

Check Mode result IF ID CT OK Core Processor Prediction Stream PC = inst EX = regs core PC core inst core regs MEM = res/addr addr core res/addr/nextPC result D-cache I-cache RF WT

Recovery Mode result IF ID CT PCinst PC inst EX regs MEM res/addr addr result D-cache I-cache RF

How Can the Simple Checker Keep Up? Slipstream Slipstream reduces power requirements of trailing car Checker processor executes inside core processor’s slipstream –fast moving air  branch/value predictions and cache prefetches –Core processor slipstream reduces complexity requirements of checker –Checker rarely sees branch mispredictions, data hazards, or cache misses

How Can the Simple Checker Keep Up? Slipstream reduces power requirements of trailing car Checker processor executes inside core processor’s slipstream –fast moving air  branch/value predictions and cache prefetches –Core processor slipstream reduces complexity requirements of checker –Checker rarely sees branch mispredictions, data hazards, or cache misses Slipstream IFIDRENREG EX/ MEM SCHEDULER CHKCT

ld f1,(X) f4 = f1 * f2 + f3 br f4 < 0, skip r8 = r8 + 1 skip:... ld f1,(X) f4 = f1 * f2 + f3 br f4 < 0, skip r8 = r8 + 1 skip:... Core Processor ExecutionChecker Execution ld*+ br+ cache misslong operationmisprediction ld + * br + ok Speeding the Checker with Core Computation Checker executes in wake of core –Leverages non-binding predictions & prefetches Virtually no stalls remain to slow checker –Control hazards resolved during core execution –Data hazards eliminated by prefetches and input value predictions Complex microarchitectural structures only necessary in core

Verifying the Checker Processor Simple checker permits complete functional verification –In-order blocking pipelines (trivial scheduler, no rename/reorder/commit) –No “internal” non-architected state Fully verified design using Sakallah’s GRASP SAT-solver [DAC01] –For Alpha integer ISA without exceptions –With small register file and memory, and small data types X Checker Model Reference Model (ISA sim) == output  Unspecified Core Predictions Always true if uArch model == Ref model Identical state?

BetaLaunchStep Launch Checked Processor Verification Traditional Verification Beta-Release Processors Traditional verification stalls launch until debug complete Checked processor verification could overlap with launch –Beta-release when checker works –Launch when performance stable –Step as needed without recalls Tape Out Tape Out

Low-Cost SER and Noise Protection Only need to address transients –Checker detects and corrects noise-related faults in core –Core processor designed without regard to strikes (e.g., no ECC…) Recycle checker inputs suspected core fault –If no error on third execution, transient strike in checker processor –If error on third execution, core processor fault occurred (e.g., SER, design error) Protect critical checker control with triple-modular redundant (TMR) logic –TMR on simple control results in only 1.3% larger checker (synthesized design) IFIDRENREGSCHEDULER EX/ MEM CHK IF CHK ID/REG CHK EX CT CHK MEM CTL 3rd opinion CTL

Fully Testable Microprocessor Designs Checker structure facilitates manufacturing tests –All checker inputs exposed to built-in-self-test logic –Checker provides built-in test signature compression Checker can be fully tested with small BIST module –less than 0.5% area increase Reduces burden of testing on core –Missed core defects corrected –Checker acts as core tester IF ID OK PC = inst PC inst EX = regs MEM = res/addr addr result D-cache I-cache RF CT WT result OKresult BIST ROM and Control Defect Free?

Self-Tuned Digital Systems Modern logic design is too conservative for dynamic verification –Unnecessary design margins consume power and performance –System may not be operating at slow corner Checker enables a self-tuned clock/voltage strategy –Push clock, drop voltage until desired power-performance characteristics –If system fails, reliable checker will correct error, notify control system –Reclaims design margins plus any temperature and voltage margins Temp max min Voltage max min Frequency max min worse-case margin Slow corner Actual operating conditions Tuned Core Checker Clock/Voltage Generator insts to verify clk Vdd clk’ Vdd’ temperature error rate

CDA 5155 Out-of-order execution: Advanced pipelines.

Similar presentations

Presentation on theme: "CDA 5155 Out-of-order execution: Advanced pipelines."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CDA 5155 Out-of-order execution: Advanced pipelines.

Similar presentations

Presentation on theme: "CDA 5155 Out-of-order execution: Advanced pipelines."— Presentation transcript:

Similar presentations

About project

Feedback