Download presentation
Presentation is loading. Please wait.
Published byPosy Newman Modified over 9 years ago
1
CDA 5155 Out-of-order execution: Advanced pipelines
2
Implications of Superscalar Execution Instruction fetch? –Taken branches, multiple branches, partial cache lines Instruction decode? –Simple for fixed length ISA, much harder for variable length Renaming? –Multi-port RT, inter-inst dependencies must be recognized Dynamic Scheduling? –Requires multiple results buses, smarter selection logic Execution? –Multiple functional units, multiple result buses Commit? –Multiple ROB/ARF ports, dependencies must be recognized
3
P4 Overview More aggressive processor –Equipped with the full set of iA32 SIMD operations –First flagship architecture since the PPro microarchitecture –Pentium 4 ISA = Pentium III ISA + SSE2 –SSE2 (Streaming SIMD Extensions 2) provides 128-bit SIMD integer and floating point operations + prefetch
4
Execution Pipeline
5
Front End Predicts branches Fetches/decodes code into trace cache Generates ops for complex instructions Prefetches instructions that are likely to be executed
6
Branch Prediction Dynamically predict the direction and target of branches based on PC using BTB If no dynamic prediction available, statically predict –Taken for backwards looping branches –Not taken for forward branches –Implemented at decode Traces built across (predicted) taken branches to avoid taken branch penalties Also includes a 16-entry return address stack predictor
7
Decoder Single decoder available –Operates at a maximum of 1 instruction per cycle Receives instructions from L2 cache 64 bits at a time Some complex instructions must enlist the micro-ROM –Used for very complex iA32 instructions (> 4 ops) –After the microcode ROM finishes, the front-end resumes fetching ops from the Trace Cache
8
Execution Pipeline
9
Trace Cache Primary instruction cache in P4 architecture –Stores 12k decoded ops On a miss, instructions are fetched from L2 Trace predictor connects traces Trace cache removes –Decode latency after mispredictions –Decode power for all pre-decoded instructions
10
Branch Hints P4 software can provide hints to branch prediction and trace cache –Specify the likely direction of a branch –Implemented with conditional branch prefixes –Used for decode-stage predictions and trace building
11
Execution Pipeline
13
Execution 126 ops can in flight at once –Up to 48 loads / 24 stores Can dispatch up to 6 ops per cycle 2x trace cache and retirement op bandwidth –Provides additional B/W for scheduling mispeculation
14
Execution Units
15
Store and Load Scheduling Out of order store and load operations Stores are always in program order 48 loads and 24 stores could be in flight Store/load buffers are allocated at the allocation stage –Total 24 store buffers and 48 load buffers
16
Execution Pipeline
17
Retirement Can retire 3 ops per cycle Implements precise exceptions Reorder buffer used to organize completed ops Also keeps track of branches and sends updated branch information to the BTB
18
Data Stream of Pentium 4 Processor
19
On-chip Caches L1 instruction cache (Trace Cache) L1 data cache L2 unified cache –All caches use a pseudo-LRU replacement algorithm Parameters:
20
L1 Data Cache Non-blocking –Support up to 4 outstanding load misses Load latency –2-clock for integer –6-clock for floating-point 1 Load and 1 Store per clock Load speculation –Assume the access will hit the cache –“Replay” the dependent instructions when miss detected
21
L2 Cache Non-blocking Load latency –Net load access latency of 7 cycles Bandwidth –1 load and 1 store in one cycle –New cache operations may begin every 2 cycles –256-bit wide bus between L1 and L2 –48Gbytes per second @ 1.5GHz
22
L2 Cache Data Prefetcher Hardware prefetcher monitors the reference patterns Bring cache lines automatically Attempts to fetch 256 bytes ahead of current access Prefetch for up to 8 simultaneous independent streams
23
Execution on MPEG4 Benchmarks @ 1 GHz
24
Problems with even faster clocks Power goes up (frequency Х voltage 2 ) –Unfortunately frequency is tied to voltage. Hard to break some operation into pipelined pieces –Can’t break all critical paths Pipeline register writes use part of clock –And as frequency grows, pipeline reg writes use more of the clock time Clock Skew –Clocks to different registers will not be perfectly aligned
25
AMD opteron
26
AMD Hammer Microarchitecture 12 Stage pipeline Pre-decode instruction mem –With ID bits to identify branch instructions and the first byte of all instructions Partitioned Register file Bigger data cache memory
27
64 bit Architectural Extensions
28
Classical Pipelining Synchronous digital circuit Partition combination logic into stages Insert pipeline registers between stages Pipeline register
29
Classical Pipelining - Problems For max performance, all stages must be busy all the time. –How many LC2K3 instructions do something useful each stage? Logic divided equally so all computations finish at exactly the same time. –How long does it take to complete the LC2K1 decode stage? Very deep pipelines have a lot of overhead writing to the pipeline registers.
30
Wave Pipelining Also referred to as maximal rate pipelining Allows multiple data waves simultaneously between successive storage elements (registers or pipeline registers). –So pipeline register are not needed. Uses clock period that is less than max propagation delay between the registers.
31
Wave Pipelining (Cont.) Data at input is changed before previous data has completely propagated through to output. Picture a water slide… Cycle time
32
Wave Pipelining Example Min delay of 16, max delay of 20
33
Wave Pipelining – Maximizing Clock Rate Minimum cycle time limited by difference between min and max Input-Output delays (and device switching speed). For max clock rate - must equalize all path delays from input to output. Factors: –Topological path differences. –Process/temperature/power variations. –Data-dependent delay variations. Intentional clock skew?
34
Wave Pipelining - Problems Operating speed constrained to narrow range of frequencies for given degree of wave pipelining. New fabrication process requires significant redesign No effective mechanism for starting/stopping: –Pipeline stalls, low speed testing? In general, very hard to do circuit analysis.
35
Benefits of Register Communication Directly specified dependencies (contained within instruction) –Accurate description of communication No false or missing dependency edges Permits realization of dataflow schedule –Early description of communication Allows scheduler pipelining without impacting speed of communication Small communication name space –Fast access to communication storage Possible to map/rename entire communication space (no tags) Possible to bypass communication storage
36
Why Memory Scheduling is Hard (Or, Why is it called HARDware?) Loads/stores also have dependencies through memory –Described by effective addresses Cannot directly leverage existing infrastructure –Indirectly specified memory dependencies Dataflow schedule is a function of program computation, prevents accurate description of communication early in the pipeline Pipelined scheduler slow to react to addresses –Large communication space (2 32-64 bytes!) cannot fully map communication space, requires more complicated cache and/or store forward network ? *p = … *q = … … = *p
37
Requirements for a Solution Accurate description of memory dependencies –No (or few) missing or false dependencies –Permit realization of dataflow schedule Early presentation of dependencies –Permit pipelining of scheduler logic Fast access to communication space –Preferably as fast as register communication (zero cycles)
38
In-order Load/Store Scheduling Schedule all loads and stores in program order –Cannot violate true data dependencies (non- speculative) Capabilities/limitations: –Not accurate - may add many false dependencies –Early presentation of dependencies (no addresses) –Not fast, all communication through memory structures Found in in-order issue pipelines st X ld Y st Z ld X ld Z truerealized Dependencies program order
39
ld Y st X In-order Load/Store Scheduling Example st X ld Y st Z ld X ld Z truerealized Dependencies program order time ld Y st Z ld X ld Z st Z ld X ld Z ld Y st X st Z ld X ld Z ld Y st X st Z ld X ld Z ld Y st X st Z ld X ld Z
40
Blind Dependence Speculation Schedule loads and stores when register dependencies satisfied –May violate true data dependencies (speculative) Capabilities/limitations: –Accurate - if little in-flight communication through memory –Early presentation of dependencies (no dependencies!) –Not fast, all communication through memory structures Most common with small windows st X ld Y st Z ld X ld Z truerealized Dependencies program order
41
ld Y st X Blind Dependence Speculation Example st X ld Y st Z ld X ld Z truerealized Dependencies program order time ld Y st Z ld X ld Z st Z ld X ld Z ld Y st X st Z ld X ld Z ld Y st X st Z ld X ld Z ld Y st X st Z ld X ld Z mispeculation detected!
42
Discussion Points Suggest a way to detect blind load mispeculation Suggest a way to recover from blind load mispeculation
43
The Case for More/Less Accurate Dependence Speculation Small windows: blind speculation is accurate for most programs, compiler can register allocate most short term communication Large windows: blind speculation performs poorly, many memory communications in execution window [For 099.go, from Moshovos96]
44
Conservative Dataflow Scheduling Schedule loads and stores when all dependencies known satisfied –Conservative - won’t violate true dependencies (non-speculative) Capabilities/limitations: –Accurate only if addresses arrive early –Late presentation of dependencies (verified with addresses) –Not fast, all communication through memory and/or complex store forward network Common for larger windows st X ld Y st?Z ld X ld Z truerealized Dependencies program order
45
ld Y st X Conservative Dataflow Scheduling st X ld Y st?Z ld X ld Z truerealized Dependencies program order time ld Y st?Z ld X ld Z st?Z ld X ld Z ld Y st X st?Z ld X ld Z ld Y st X st Z ld X ld Z ld Y st X st Z ld X ld Z ld Y st X st Z ld X ld Z Z stall cycle
46
Discussion Points What if no dependent store or unknown store address is found? Describe the logic used to locate dependent store instructions What is the tradeoff between small and large memory schedulers? How should uncached loads/stores be handled? Video RAM?
47
Memory Dependence Speculation [Moshovos96] Schedule loads and stores when data dependencies satisfied –Uses dependence predictor to match sourcing stores to loads –Doesn’t wait for addresses, may violate true dependencies (speculative) Capabilities/limitations: –Accurate as predictor –Early presentation of dependencies (data addresses not used in prediction) –Not fast, all communication through memory structures st?X ld Y st?Z ld X ld Z truerealized Dependencies program order
48
Dependence Speculation - In a Nutshell Assumes static placement of dependence edges is persistent –Good assumption! Common cases: –Accesses to global variables –Stack accesses –Accesses to aliased heap data Predictor tracks store/load PCs, reproduces last sourcing store PC given load PC A: *p = … B: *q = … C: … = *p Dependence Predictor CA or B
49
ld Y st?X Memory Dependence Speculation Example st?X ld Y st?Z ld X ld Z truerealized Dependencies program order time ld Y st?Z ld X ld Z st?Z ld X ld Z ld Y st?X st?Z ld X ld Z ld Y st X st?Z ld X ld Z ld Y st X st?Z ld X ld Z X
50
Memory Renaming [Tyson/Austin97] Design maxims: – Registers Good, Memory Bad – Stores/Loads Contribute Nothing to Program Results Basic idea: –Leverage dependence predictor to map memory communication onto register synchronization and communication infrastructure Benefits: –Accurate dependence info if predictor is accurate –Early presentation of dependence predictions –Fast communication through register infrastructure
51
st X ld Y st Z ld X ld Z I1 I4 I5 I2 Memory Renaming Example Renamed dependence edges operate at bypass speed Load/store address stream becomes “checker” stream –Need only be high-B/W (if predictor performs well) –Risky to remove memory accesses completely st X ld Y st Z ld X ld Z I1ld Y I4 I5 I2
52
Memory Renaming Implementation Speculative loads require recovery mechanism Enhancements muddy boundaries between dependence, address, and value prediction –Long lived edges reside in rename table as addresses –Semi-constants also promoted into rename table Dependence Predictor store/load PC’s predicted edge name (5-9 bit tag) ID Edge Rename Table) REN physical storage assignment (destination for stores, source for loads) one entry per edge
53
Experimental Evaluation Implemented on SimpleScalar 2.0 baseline Dynamic scheduling timing simulation (sim-outorder) –256 instruction RUU –Aggressive front end –Typical 2-level cache memory hierarchy Aggressive memory renaming support –4k entries in dependence predictor –512 edge names, LRU allocated Load speculation support –Squash recovery –Selective re-execution recovery
54
Dependence Predictor Good coverage of “in-flight” communication Lots of room for improvement
55
Program Performance Performance predicated on: –High-B/W fetch mechanism –Efficient mispeculation recovery mechanism Better speedups with: –Larger execution windows –Increased store forward latency –Confidence mechanism
56
Additional Work Turning of the crank - continue to improve base mechanisms –Predictors (loop carried dependencies, better stack/global prediction) –Improve mispeculation recovery performance Value-oriented memory hierarchy Data value speculation Compiler-based renaming (tagged stores and loads): store r1,(r2):t1 store r3,(r4):t2 load r5,(r6):t1
57
Scalable Microarchitectures Traditional microarchitecture designs are running out of gas… Many research opportunities –High-B/W fetch architectures (dual path execution, multiple-branch predictors) –Pipeline schedulers (partitioned resources, run-ahead processing) –Fast execution cores/memory (memory renaming, region caching) –Power-efficient microa rchitectures (ASP) IFIDRENREG in-order fetch SCHEDULER in-order retirement EX/ MEM out-of-order execute non-speculative inputs CT Reg/Mem & Caches Diminishing ILP Fetch Starved Slow Wires Slow Memory Unreliable Logic Increasing Power
58
Microprocessor Verification The fatalist’s approach to microprocessor verification! Core technology: dynamic verification (DIVA: Todd Austin) –Simple (and correct) checker processor verifies all results before retirement –Reduces the burden of correctness on the core processor design –Core processor relegated to branch/value prediction and cache prefetch Fundamentally changes the design of a complex microprocessor –Complete formal verification feasible-- Low-cost SER protection –Beta-release microprocessors-- Self-tuned digital circuits speculative instructions in-order with inputs and outputs Fault Tolerant CoreChecker IFIDRENREG EX/ MEM SCHEDULER CHK Reg/Mem & Caches non-speculative inputs CT
59
The Burden of Verification Immense test space –Impossible to fully test the system –For example, 32 regs, 8k caches, 300 pins = 2 132396 states –Conservative estimate, microarchitectural state increases the test space Done with respect to ill-defined reference –What is correct? Often defined by PRM + old designs + guru guidance Expensive –Large fraction of design team dedicated to verification –Increases time-to-market, often as much as 1-2 years High-risk –Typically only one chance to “get it right” –Failures can be costly: replacement parts, bad PR, lawsuits, fatalities
60
Simulation Based Verification Determines if design is functionally correct at the logic level Implemented with co-simulation of “important” test cases –Mostly before tape out using RTL/logic level simulators Differences found at output drive debug Process continues until “sufficient” coverage of test space “important” test cases uArch Model Reference Model (ISA sim) == output Test OK?
61
Formal Verification Formal verification speeds testing by comparing models –Compare reference and uArch model using formal methods (e.g., SAT) –If models shown functionally equivalent, any program renders same result –Much better coverage than simulation-based verification Unfortunately, intractable task for complete modern pipeline –Problems: imprecise state, microarchitectural state, out-of-order operations –Machines we build are not functionally equivalent to reference machine! X uArch Model Reference Model (ISA sim) == state Always true if uArch model == Ref model Identical state?
62
Deep Submicron Reliability Challenges More difficult to build robust systems in denser technologies –Degraded signal quality Increased interconnect capacitance results in signal crosstalk Reduced supply voltage degrades noise immunity Increased current demands create supply voltage noise –Single event radiation/soft errors (SER) Alpha particles (from atomic impurities) and gamma rays (from space) Energetic particle strikes destroy charge, may switch small transistors Inexpensive shielding solutions unlikely to materialize –Increased complexity More transistors will likely mean greater complexity Verification demands and probability of failure will increase
63
Motivating Observations Speculative execution is fault-tolerant –Design errors, timing errors, and electrical faults only manifest as performance divots –Correct checking mechanism will fix errors What if all computation, communication, control, and progress were speculative? –Any incorrect computation fixed maximally speculative –Any core fault fixed minimally correct X PC always not taken stuck-at fault branch predictor array
64
Motivating Observations (continued) Reliable online functional verification will cover most faults –Single-event upsets –Design faults and incomplete implementation –Data-dependent and noise-related electrical faults –Untestable silicon defects and in field circuit failures – We utilize an simple hardware approach to detect and correct faults Increasing the degree of speculation reduces exposure to faults –Predictors need not be fully correct, either functionally or electrically – Our approach leverages a maximally speculative architecture Processors have complex implementations, yet simple semantics –Need not validate the internal workings, only exposed semantics – We only check instruction semantics to keep overheads low
65
Core computation, communication, and control validated by checker – –Instructions verified by checker in program order before retirement – –Checker detects and corrects faulty results, restarts core Checker relaxes the burden of correctness on the core processor – –Robust checker corrects faults in any core structure not used by checker – –Tolerates core design errors, electrical faults, silicon defects, and failures – –Core only has burden of high accuracy prediction Key checker requirements: simple, fast, and reliable Dynamic Verification: Seatbelts for Your CPU speculative instructions in-order with PC, inst, inputs, addr Complex Core ProcessorChecker Processor IFIDRENREG EX/ MEM SCHEDULER CHKCT
66
result Checker Processor Architecture IF ID CT OK Core Processor Prediction Stream PC = inst PC inst EX = regs core PC core inst core regs MEM = res/addr addr core res/addr/nextPC result D-cache I-cache RF WT
67
Check Mode result IF ID CT OK Core Processor Prediction Stream PC = inst EX = regs core PC core inst core regs MEM = res/addr addr core res/addr/nextPC result D-cache I-cache RF WT
68
Recovery Mode result IF ID CT PCinst PC inst EX regs MEM res/addr addr result D-cache I-cache RF
69
How Can the Simple Checker Keep Up? Slipstream Slipstream reduces power requirements of trailing car Checker processor executes inside core processor’s slipstream –fast moving air branch/value predictions and cache prefetches –Core processor slipstream reduces complexity requirements of checker –Checker rarely sees branch mispredictions, data hazards, or cache misses
70
How Can the Simple Checker Keep Up? Slipstream Slipstream reduces power requirements of trailing car Checker processor executes inside core processor’s slipstream –fast moving air branch/value predictions and cache prefetches –Core processor slipstream reduces complexity requirements of checker –Checker rarely sees branch mispredictions, data hazards, or cache misses
71
How Can the Simple Checker Keep Up? Slipstream reduces power requirements of trailing car Checker processor executes inside core processor’s slipstream –fast moving air branch/value predictions and cache prefetches –Core processor slipstream reduces complexity requirements of checker –Checker rarely sees branch mispredictions, data hazards, or cache misses Slipstream IFIDRENREG EX/ MEM SCHEDULER CHKCT
72
ld f1,(X) f4 = f1 * f2 + f3 br f4 < 0, skip r8 = r8 + 1 skip:... ld f1,(X) f4 = f1 * f2 + f3 br f4 < 0, skip r8 = r8 + 1 skip:... Core Processor ExecutionChecker Execution ld*+ br+ cache misslong operationmisprediction ld + * br + ok Speeding the Checker with Core Computation Checker executes in wake of core –Leverages non-binding predictions & prefetches Virtually no stalls remain to slow checker –Control hazards resolved during core execution –Data hazards eliminated by prefetches and input value predictions Complex microarchitectural structures only necessary in core
73
Verifying the Checker Processor Simple checker permits complete functional verification –In-order blocking pipelines (trivial scheduler, no rename/reorder/commit) –No “internal” non-architected state Fully verified design using Sakallah’s GRASP SAT-solver [DAC01] –For Alpha integer ISA without exceptions –With small register file and memory, and small data types X Checker Model Reference Model (ISA sim) == output Unspecified Core Predictions Always true if uArch model == Ref model Identical state?
74
BetaLaunchStep Launch Checked Processor Verification Traditional Verification Beta-Release Processors Traditional verification stalls launch until debug complete Checked processor verification could overlap with launch –Beta-release when checker works –Launch when performance stable –Step as needed without recalls Tape Out Tape Out
75
Low-Cost SER and Noise Protection Only need to address transients –Checker detects and corrects noise-related faults in core –Core processor designed without regard to strikes (e.g., no ECC…) Recycle checker inputs suspected core fault –If no error on third execution, transient strike in checker processor –If error on third execution, core processor fault occurred (e.g., SER, design error) Protect critical checker control with triple-modular redundant (TMR) logic –TMR on simple control results in only 1.3% larger checker (synthesized design) IFIDRENREGSCHEDULER EX/ MEM CHK IF CHK ID/REG CHK EX CT CHK MEM CTL 3rd opinion CTL
76
Fully Testable Microprocessor Designs Checker structure facilitates manufacturing tests –All checker inputs exposed to built-in-self-test logic –Checker provides built-in test signature compression Checker can be fully tested with small BIST module –less than 0.5% area increase Reduces burden of testing on core –Missed core defects corrected –Checker acts as core tester IF ID OK PC = inst PC inst EX = regs MEM = res/addr addr result D-cache I-cache RF CT WT result OKresult BIST ROM and Control Defect Free?
77
Self-Tuned Digital Systems Modern logic design is too conservative for dynamic verification –Unnecessary design margins consume power and performance –System may not be operating at slow corner Checker enables a self-tuned clock/voltage strategy –Push clock, drop voltage until desired power-performance characteristics –If system fails, reliable checker will correct error, notify control system –Reclaims design margins plus any temperature and voltage margins Temp max min Voltage max min Frequency max min worse-case margin Slow corner Actual operating conditions Tuned Core Checker Clock/Voltage Generator insts to verify clk Vdd clk’ Vdd’ temperature error rate
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.