A. Moshovos ©ECE Fall ‘07 ECE Toronto Out-of-Order Execution Structures
A. Moshovos ©ECE Fall ‘07 ECE Toronto MIPS R10000-Like Design Based on: –Complexity-Effective Superscalar Processors –S. Palacharla, N. Jouppi and J. E. Smith, ISCA 97
A. Moshovos ©ECE Fall ‘07 ECE Toronto Fetch Phase Fetch: –Read instructions from I-Cache –Predict Branches –Pass on to Decode phase
A. Moshovos ©ECE Fall ‘07 ECE Toronto Decode Phase Decode: –Parse instruction –Shuffle opcode parts to appropriate ports for rename
A. Moshovos ©ECE Fall ‘07 ECE Toronto Renaming Phase Rename: –Map Architectural registers to Physical –Eliminate False Dependences –Passes renamed instructions to scheduler Called Dispatch
A. Moshovos ©ECE Fall ‘07 ECE Toronto Scheduling Phase Wakeup: –Instructions check whether they become ready –From Writeback: physical register names Select: –Amongst the ready select those to execute –Structural hazards
A. Moshovos ©ECE Fall ‘07 ECE Toronto Register File Read Phase Read source operands
A. Moshovos ©ECE Fall ‘07 ECE Toronto Bypass and Execute Phase
A. Moshovos ©ECE Fall ‘07 ECE Toronto Data Cache Access Phase
A. Moshovos ©ECE Fall ‘07 ECE Toronto Writeback Phase Write result to register file Broadcast tag in order to wakeup waiting instructions –Notice that the tag broadcast should happen TWO cycles in advance of the result production
A. Moshovos ©ECE Fall ‘07 ECE Toronto Reservation Station Model Used by Pentium Pro, PowerPC 604 Re-order buffer holds values Renaming points to re-order buffer entries –Tomasulo-like
A. Moshovos ©ECE Fall ‘07 ECE Toronto Physical Register File vs. Reservation Station Physical Register File –Values reside in the register file –At writeback instructions broadcast the register name Reservation Stations: –Values reside: –In the register file upon commit Non-speculative –In reservation stations prior to commit Speculative
A. Moshovos ©ECE Fall ‘07 ECE Toronto Quantifying Complexity Critical Path Delay as a function of architectural parameters –Instruction Window size (WinSize) –Issue Width (IW) Full-custom Implementations –Study the critical path –Delay model –Extrapolate how it will scale with “future” technologies
A. Moshovos ©ECE Fall ‘07 ECE Toronto Renaming Inputs: –IW instructions –Up to 2 x Input register names –Up to 1 x Output register name Outputs: –2 x input physical registers –1 x new output physical register –1 x previous physical register name for checkpointing –Updated rename table Superscalar Issue complicates things a bit
A. Moshovos ©ECE Fall ‘07 ECE Toronto Renaming One Instruction s1s2d RAT p0 p31 s1s2 old d new reg from free list Write port Read port For mispeculation recovery
A. Moshovos ©ECE Fall ‘07 ECE Toronto Renaming Two Instructions RAT s1 s2 d new d s1 s2 d new d ? ? ? ps1 ps2 Old d new d ps1 ps2 Old d new d Cross Bundle Dependency Check Logic
A. Moshovos ©ECE Fall ‘07 ECE Toronto Renaming More Instructions Dependency Checking logic for instruction i must match against all preceding destinations If there are multiple matches it must enforce priority: –Pick the one closest to this instruction
A. Moshovos ©ECE Fall ‘07 ECE Toronto RAT: SRAM Implementation decoder SRAM cell bitlines Sense amp Arch reg Phys reg #ARCH REGS lg(#PHYS REGS)
A. Moshovos ©ECE Fall ‘07 ECE Toronto SRAM RAT cell
A. Moshovos ©ECE Fall ‘07 ECE Toronto RAT: CAM Implementation encoder CAM cell Arch reg Phys reg #PHYS REGS lg(#ARCH REGS) Active bit One CAM per physical register Active bit indicates the current map New version by setting active bit
A. Moshovos ©ECE Fall ‘07 ECE Toronto CAM Cell
A. Moshovos ©ECE Fall ‘07 ECE Toronto SRAM vs. CAM SRAM: –Arch reg rows –Lg(phy reg) cols –SRAM read/write CAM: –Phy reg rows –Lg(arch reg) cols –CAM match –Update: Reset previous valid bit Set current valid bit
A. Moshovos ©ECE Fall ‘07 ECE Toronto Scheduler: Part #1 - Wakeup
A. Moshovos ©ECE Fall ‘07 ECE Toronto Tree of Arbiters REQ Signals GRANT Signals Anyreq raised if any req is active, Grant Issued if arbiter enabled Root enabled if FU available Scheduler: Part #2 - Select For a Single FU Location based select policy
A. Moshovos ©ECE Fall ‘07 ECE Toronto Select for more than one FUs Handling Multiple FUs of Same Type: –Stack Select logic blocks in series - hierarchy –Mask the Request granted to previous unit NOT Feasible for More than 2 FUs Alternative: –statically partition issue window among FUs – MIPS R10000, HP PA 8000
A. Moshovos ©ECE Fall ‘07 ECE Toronto Datapath and Bypass Commonly Used Layout: 1 Bit-Slice Turn on Tri- State A to pass result of FU1 to left operand of FU0
A. Moshovos ©ECE Fall ‘07 ECE Toronto Complexity Analysis Critical path delay as a function of: –Issue Width –Window Size Register Renaming Table Wakeup and Select Bypass paths
A. Moshovos ©ECE Fall ‘07 ECE Toronto Methodology A representative CMOS design is selected from published alternatives Implemented the circuits for 3 technologies: –0.8micron, 0.35micron and 0.18 micron Optimize for speed Wire parasitics in delay model –Rmetal, Cmetal
A. Moshovos ©ECE Fall ‘07 ECE Toronto Methodology Feature size scaling: 1 / S Voltage scaling: 1 / U Logic Delay = (C L x V) / I Capac. Load: C L = 1 1 / S Supply Voltage: V = 1 1 / U Average charge/discharge current: I = 1 1 / U So, Logic Delay = (1 / S x 1 / U ) / (1 / U) = 1 / S
A. Moshovos ©ECE Fall ‘07 ECE Toronto Wire Delay L: wire length Intrinsic RC delay Rmetal: resistance per unit length Cmetal: capacitance per unit length 0.5: 1 st order approximation of distributed RC model – uniformly distributed R & C
A. Moshovos ©ECE Fall ‘07 ECE Toronto Wire Delay Scaling Metal Thickness doesn’t scale much –Width ~ 1/S –Rmetal ~ S Fringe Capacitance dominates in smaller feature sizes – edges to parallel wires and the substrate Parallel plate – scales with 1 / S –Cmetal ~ S Length scales with 1/S Overall Scale factor: S x S x (1/S) 2 = 1 Wire delay remains constant
A. Moshovos ©ECE Fall ‘07 ECE Toronto Register Renaming Table
A. Moshovos ©ECE Fall ‘07 ECE Toronto Dependency Checking Logic Accessed in Parallel with Map Table Every Logical Reg compared against logical dest regs of current rename group For IW=2,4,8, delay less than map table r1 r4
A. Moshovos ©ECE Fall ‘07 ECE Toronto Renaming Delay SRAM scheme Delay Components: –Time to decode the arch reg index –Time to drive wordline –Time to pull down bit line –Time for SenseAmp to detect pull-down –MUX time ignored as control from dep. Check logic comes in advance
A. Moshovos ©ECE Fall ‘07 ECE Toronto Renaming Circuit
A. Moshovos ©ECE Fall ‘07 ECE Toronto Decoder Delay
A. Moshovos ©ECE Fall ‘07 ECE Toronto Decoder Delay Predecoding for speed Length of predecode lines: –Cellheight: Height of single cell excluding wordlines –Wordline spacing NVREG: # of virtual reg-s x3: 3-operand instr-s
A. Moshovos ©ECE Fall ‘07 ECE Toronto Decoder Delay Tnand fall delay of NAND Tnor rise delay of NOR Rnandpd NAND pull-down channel resistance + Predecode line metal resistance Ceq diff-n Cap. of NAND + gate Cap. of NOR + interconnect Cap.
A. Moshovos ©ECE Fall ‘07 ECE Toronto Decoder Delay Substitute Predecode line length, Req and Ceq we get: c2: intrinsic RC delay of predecode line c2 very small Decoder delay ~linearly dependent on IW
A. Moshovos ©ECE Fall ‘07 ECE Toronto Rename Delay Wordline c2: intrinsic RC delay of wordline c2 very small Wordline delay ~linearly dependent on IW
A. Moshovos ©ECE Fall ‘07 ECE Toronto Rename Delay Bitline: c2 very small Bitline delay ~linearly dependent on IW SenseAmp delay ~linearly dependent on IW
A. Moshovos ©ECE Fall ‘07 ECE Toronto Rename Logic Delay Scaling Feature size - [increase in bitline&wordline delay with increasing IW] 0.8um: IW 2 8 Bitline delay + 37% 0.18um: IW 2 8 Bitline delay + 53% Total delay increases linearly with IW Each Component shows linear increase with IW Bitline Delay > Wordline Delay Bitline length ~ # of Logical reg-s Wordline length ~ width of physical reg designator IW impact on delay worsens with decreasing feature size
A. Moshovos ©ECE Fall ‘07 ECE Toronto Wakeup Delay Critical Path: Mismatch Pull ready signal low Delay Components: –Tag drivers drive tag lines - vertical –Mismatched bit: pull down stack pull matchline low – horizontal –Final OR gate or all the matchlines of an operand tag Ttagdrive ~ Driver Pullup R & Tagline length & Tagline Load C Quadratic component significant for IW>2 & 0.18um
A. Moshovos ©ECE Fall ‘07 ECE Toronto Wakeup Delay Quadratic component Small for both cases Both delays ~linearly dependent on IW
A. Moshovos ©ECE Fall ‘07 ECE Toronto Wakeup Delay: IW and Window Size 0.18um Process Quadratic dependence Issue width has greater effect increase all 3 delay components As IW & WinSize + together delay actually changes like: THIS
A. Moshovos ©ECE Fall ‘07 ECE Toronto Wakeup Delay: Window Size 8 way & 0.18 Process Tag drive delay increases rapidly with WinSize + Match OR delay constant
A. Moshovos ©ECE Fall ‘07 ECE Toronto Wakeup Delay: Feature size 8 way & 64 entry window Tag drive and Tag match delays do not scale as well as MatchOR delay Match OR logic delay Others also have wire delays
A. Moshovos ©ECE Fall ‘07 ECE Toronto Selection Logic and Bypass Delay Selection –Logarithmically dependent on WinSize Bypass: Delay dependent on (IW)2