A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures.

Slides:



Advertisements
Similar presentations
1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.
Advertisements

Real-time Signal Processing on Embedded Systems Advanced Cutting-edge Research Seminar I&III.
1 Lecture 11: Modern Superscalar Processor Models Generic Superscalar Models, Issue Queue-based Pipeline, Multiple-Issue Design.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.
Semiconductor Memory Design. Organization of Memory Systems Driven only from outside Data flow in and out A cell is accessed for reading by selecting.
Out-of-Order Machine State Instruction Sequence: Inorder State: Look-ahead State: Architectural State: R3  A R7  B R8  C R7  D R4  E R3  F R8  G.
COEN 180 DRAM. Dynamic Random Access Memory Dynamic: Periodically refresh information in a bit cell. Else it is lost. Small footprint: transistor + capacitor.
THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
© A. Moshovos (ECE, Toronto) ECE1773 – Spring 2002 ILP, cont. Maintaining Sequential Appearance –Precise Interrupts –RUU approach to OoO Scheduling.
National & Kapodistrian University of Athens Dep.of Informatics & Telecommunications MSc. In Computer Systems Technology Advanced Computer Architecture.
Introduction to CMOS VLSI Design Lecture 13: SRAM
CS 7810 Lecture 2 Complexity-Effective Superscalar Processors S. Palacharla, N.P. Jouppi, J.E. Smith U. Wisconsin, WRL ISCA ’97.
1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
CS 152 Computer Architecture and Engineering Lecture 15 - Advanced Superscalars Krste Asanovic Electrical Engineering and Computer Sciences University.
Introduction to CMOS VLSI Design SRAM/DRAM
Defining Wakeup Width for Efficient Dynamic Scheduling A. Aggarwal, O. Ergin – Binghamton University M. Franklin – University of Maryland Presented by:
March 9, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Krste Asanovic Electrical.
Trace Processors Presented by Nitin Kumar Eric Rotenberg Quinn Jacobson, Yanos Sazeides, Jim Smith Computer Science Department University of Wisconsin-Madison.
Lecture 19: SRAM.
Lecture 8 Shelving in Superscalar Processors (Part 1)
1 Practical Selective Replay for Reduced-Tag Schedulers Dan Ernst and Todd Austin Advanced Computer Architecture Lab The University of Michigan June 8.
Review: Basic Building Blocks  Datapath l Execution units -Adder, multiplier, divider, shifter, etc. l Register file and pipeline registers l Multiplexers,
1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement.
A Centralized Cache Miss Driven Technique to Improve Processor Power Dissipation Houman Homayoun, Avesta Makhzan, Jean-Luc Gaudiot, Alex Veidenbaum University.
Z. Feng MTU EE4800 CMOS Digital IC Design & Analysis 12.1 EE4800 CMOS Digital IC Design & Analysis Lecture 12 SRAM Zhuo Feng.
Advanced VLSI Design Unit 06: SRAM
Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.
Anshul Kumar, CSE IITD CSL718 : Superscalar Processors Issue and Despatch 23rd Jan, 2006.
Microprocessor Microarchitecture Limits of Instruction-Level Parallelism Lynn Choi Dept. Of Computer and Electronics Engineering.
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Trace cache and Back-end Oper. CSE 4711 Instruction Fetch Unit Using I-cache I-cache I-TLB Decoder Branch Pred Register renaming Execution units.
Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day9:
1 Lecture 5 Overview of Superscalar Techniques CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading: Textbook, Ch. 2.1 “Complexity-Effective.
Reducing Issue Logic Complexity in Superscalar Microprocessors Survey Project CprE 585 – Advanced Computer Architecture David Lastine Ganesh Subramanian.
© Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture Instruction Execution: Dynamic Scheduling.
1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.
OOO Pipelines - II Smruti R. Sarangi IIT Delhi 1.
OOO Pipelines - III Smruti R. Sarangi Computer Science and Engineering, IIT Delhi.
1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.
Microarchitecture of Superscalars (6) Register renaming Dezső Sima Spring 2008 (Ver. 2.0)  Dezső Sima, 2008.
CSE431 L13 SS Execute & Commit.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 13: SS Backend (Execute, Writeback & Commit) Mary Jane.
CS203 – Advanced Computer Architecture ILP and Speculation.
Graduate Seminar Using Lazy Instruction Prediction to Reduce Processor Wakeup Power Dissipation Houman Homayoun April 2005.
Lecture 19: SRAM.
Smruti R. Sarangi IIT Delhi
Lynn Choi Dept. Of Computer and Electronics Engineering
Smruti R. Sarangi Computer Science and Engineering, IIT Delhi
Processor (I).
Sequential Execution Semantics
Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Mary Jane Irwin ( ) CSE477 VLSI Digital Circuits Fall 2002 Lecture 22: Shifters, Decoders, Muxes Mary Jane.
COMPLEXITY EFFECTIVE SUPERSCALAR PROCESSORS
Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Smruti R. Sarangi IIT Delhi
Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
Smruti R. Sarangi Computer Science and Engineering, IIT Delhi
Topics Circuit design for FPGAs: Logic elements. Interconnect.
Krste Asanovic Electrical Engineering and Computer Sciences
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
Out-of-Order Execution Structures Optimizations
Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011
Presentation transcript:

A. Moshovos ©ECE Fall ‘07 ECE Toronto Out-of-Order Execution Structures

A. Moshovos ©ECE Fall ‘07 ECE Toronto MIPS R10000-Like Design Based on: –Complexity-Effective Superscalar Processors –S. Palacharla, N. Jouppi and J. E. Smith, ISCA 97

A. Moshovos ©ECE Fall ‘07 ECE Toronto Fetch Phase Fetch: –Read instructions from I-Cache –Predict Branches –Pass on to Decode phase

A. Moshovos ©ECE Fall ‘07 ECE Toronto Decode Phase Decode: –Parse instruction –Shuffle opcode parts to appropriate ports for rename

A. Moshovos ©ECE Fall ‘07 ECE Toronto Renaming Phase Rename: –Map Architectural registers to Physical –Eliminate False Dependences –Passes renamed instructions to scheduler Called Dispatch

A. Moshovos ©ECE Fall ‘07 ECE Toronto Scheduling Phase Wakeup: –Instructions check whether they become ready –From Writeback: physical register names Select: –Amongst the ready select those to execute –Structural hazards

A. Moshovos ©ECE Fall ‘07 ECE Toronto Register File Read Phase Read source operands

A. Moshovos ©ECE Fall ‘07 ECE Toronto Bypass and Execute Phase

A. Moshovos ©ECE Fall ‘07 ECE Toronto Data Cache Access Phase

A. Moshovos ©ECE Fall ‘07 ECE Toronto Writeback Phase Write result to register file Broadcast tag in order to wakeup waiting instructions –Notice that the tag broadcast should happen TWO cycles in advance of the result production

A. Moshovos ©ECE Fall ‘07 ECE Toronto Reservation Station Model Used by Pentium Pro, PowerPC 604 Re-order buffer holds values Renaming points to re-order buffer entries –Tomasulo-like

A. Moshovos ©ECE Fall ‘07 ECE Toronto Physical Register File vs. Reservation Station Physical Register File –Values reside in the register file –At writeback instructions broadcast the register name Reservation Stations: –Values reside: –In the register file upon commit Non-speculative –In reservation stations prior to commit Speculative

A. Moshovos ©ECE Fall ‘07 ECE Toronto Quantifying Complexity Critical Path Delay as a function of architectural parameters –Instruction Window size (WinSize) –Issue Width (IW) Full-custom Implementations –Study the critical path –Delay model –Extrapolate how it will scale with “future” technologies

A. Moshovos ©ECE Fall ‘07 ECE Toronto Renaming Inputs: –IW instructions –Up to 2 x Input register names –Up to 1 x Output register name Outputs: –2 x input physical registers –1 x new output physical register –1 x previous physical register name for checkpointing –Updated rename table Superscalar Issue complicates things a bit

A. Moshovos ©ECE Fall ‘07 ECE Toronto Renaming One Instruction s1s2d RAT p0 p31 s1s2 old d new reg from free list Write port Read port For mispeculation recovery

A. Moshovos ©ECE Fall ‘07 ECE Toronto Renaming Two Instructions RAT s1 s2 d new d s1 s2 d new d ? ? ? ps1 ps2 Old d new d ps1 ps2 Old d new d Cross Bundle Dependency Check Logic

A. Moshovos ©ECE Fall ‘07 ECE Toronto Renaming More Instructions Dependency Checking logic for instruction i must match against all preceding destinations If there are multiple matches it must enforce priority: –Pick the one closest to this instruction

A. Moshovos ©ECE Fall ‘07 ECE Toronto RAT: SRAM Implementation decoder SRAM cell bitlines Sense amp Arch reg Phys reg #ARCH REGS lg(#PHYS REGS)

A. Moshovos ©ECE Fall ‘07 ECE Toronto SRAM RAT cell

A. Moshovos ©ECE Fall ‘07 ECE Toronto RAT: CAM Implementation encoder CAM cell Arch reg Phys reg #PHYS REGS lg(#ARCH REGS) Active bit One CAM per physical register Active bit indicates the current map New version by setting active bit

A. Moshovos ©ECE Fall ‘07 ECE Toronto CAM Cell

A. Moshovos ©ECE Fall ‘07 ECE Toronto SRAM vs. CAM SRAM: –Arch reg rows –Lg(phy reg) cols –SRAM read/write CAM: –Phy reg rows –Lg(arch reg) cols –CAM match –Update: Reset previous valid bit Set current valid bit

A. Moshovos ©ECE Fall ‘07 ECE Toronto Scheduler: Part #1 - Wakeup

A. Moshovos ©ECE Fall ‘07 ECE Toronto Tree of Arbiters REQ Signals GRANT Signals Anyreq raised if any req is active, Grant Issued if arbiter enabled Root enabled if FU available Scheduler: Part #2 - Select For a Single FU Location based select policy

A. Moshovos ©ECE Fall ‘07 ECE Toronto Select for more than one FUs Handling Multiple FUs of Same Type: –Stack Select logic blocks in series - hierarchy –Mask the Request granted to previous unit NOT Feasible for More than 2 FUs Alternative: –statically partition issue window among FUs – MIPS R10000, HP PA 8000

A. Moshovos ©ECE Fall ‘07 ECE Toronto Datapath and Bypass Commonly Used Layout: 1 Bit-Slice Turn on Tri- State A to pass result of FU1 to left operand of FU0

A. Moshovos ©ECE Fall ‘07 ECE Toronto Complexity Analysis Critical path delay as a function of: –Issue Width –Window Size Register Renaming Table Wakeup and Select Bypass paths

A. Moshovos ©ECE Fall ‘07 ECE Toronto Methodology A representative CMOS design is selected from published alternatives Implemented the circuits for 3 technologies: –0.8micron, 0.35micron and 0.18 micron Optimize for speed Wire parasitics in delay model –Rmetal, Cmetal

A. Moshovos ©ECE Fall ‘07 ECE Toronto Methodology Feature size scaling: 1 / S Voltage scaling: 1 / U Logic Delay = (C L x V) / I Capac. Load: C L = 1  1 / S Supply Voltage: V = 1  1 / U Average charge/discharge current: I = 1  1 / U So, Logic Delay = (1 / S x 1 / U ) / (1 / U) = 1 / S

A. Moshovos ©ECE Fall ‘07 ECE Toronto Wire Delay L: wire length Intrinsic RC delay  Rmetal: resistance per unit length Cmetal: capacitance per unit length 0.5: 1 st order approximation of distributed RC model – uniformly distributed R & C

A. Moshovos ©ECE Fall ‘07 ECE Toronto Wire Delay Scaling Metal Thickness doesn’t scale much –Width ~ 1/S –Rmetal ~ S Fringe Capacitance dominates in smaller feature sizes – edges to parallel wires and the substrate Parallel plate – scales with 1 / S –Cmetal ~ S Length scales with 1/S Overall Scale factor: S x S x (1/S) 2 = 1 Wire delay remains constant

A. Moshovos ©ECE Fall ‘07 ECE Toronto Register Renaming Table

A. Moshovos ©ECE Fall ‘07 ECE Toronto Dependency Checking Logic Accessed in Parallel with Map Table Every Logical Reg compared against logical dest regs of current rename group For IW=2,4,8, delay less than map table r1 r4

A. Moshovos ©ECE Fall ‘07 ECE Toronto Renaming Delay SRAM scheme Delay Components: –Time to decode the arch reg index –Time to drive wordline –Time to pull down bit line –Time for SenseAmp to detect pull-down –MUX time ignored as control from dep. Check logic comes in advance

A. Moshovos ©ECE Fall ‘07 ECE Toronto Renaming Circuit

A. Moshovos ©ECE Fall ‘07 ECE Toronto Decoder Delay

A. Moshovos ©ECE Fall ‘07 ECE Toronto Decoder Delay Predecoding for speed Length of predecode lines: –Cellheight: Height of single cell excluding wordlines –Wordline spacing NVREG: # of virtual reg-s x3: 3-operand instr-s

A. Moshovos ©ECE Fall ‘07 ECE Toronto Decoder Delay Tnand fall delay of NAND Tnor rise delay of NOR Rnandpd NAND pull-down channel resistance + Predecode line metal resistance Ceq diff-n Cap. of NAND + gate Cap. of NOR + interconnect Cap.

A. Moshovos ©ECE Fall ‘07 ECE Toronto Decoder Delay Substitute Predecode line length, Req and Ceq we get: c2: intrinsic RC delay of predecode line c2 very small Decoder delay ~linearly dependent on IW

A. Moshovos ©ECE Fall ‘07 ECE Toronto Rename Delay Wordline c2: intrinsic RC delay of wordline c2 very small  Wordline delay ~linearly dependent on IW

A. Moshovos ©ECE Fall ‘07 ECE Toronto Rename Delay Bitline: c2 very small Bitline delay ~linearly dependent on IW SenseAmp delay ~linearly dependent on IW

A. Moshovos ©ECE Fall ‘07 ECE Toronto Rename Logic Delay Scaling Feature size -  [increase in bitline&wordline delay with increasing IW] 0.8um: IW 2  8  Bitline delay + 37% 0.18um: IW 2  8  Bitline delay + 53% Total delay increases linearly with IW Each Component shows linear increase with IW Bitline Delay > Wordline Delay Bitline length ~ # of Logical reg-s Wordline length ~ width of physical reg designator IW impact on delay worsens with decreasing feature size

A. Moshovos ©ECE Fall ‘07 ECE Toronto Wakeup Delay Critical Path: Mismatch  Pull ready signal low Delay Components: –Tag drivers  drive tag lines - vertical –Mismatched bit: pull down stack  pull matchline low – horizontal –Final OR gate  or all the matchlines of an operand tag Ttagdrive ~ Driver Pullup R & Tagline length & Tagline Load C Quadratic component significant for IW>2 & 0.18um

A. Moshovos ©ECE Fall ‘07 ECE Toronto Wakeup Delay Quadratic component Small for both cases Both delays ~linearly dependent on IW

A. Moshovos ©ECE Fall ‘07 ECE Toronto Wakeup Delay: IW and Window Size 0.18um Process Quadratic dependence Issue width has greater effect  increase all 3 delay components As IW & WinSize + together  delay actually changes like: THIS

A. Moshovos ©ECE Fall ‘07 ECE Toronto Wakeup Delay: Window Size 8 way & 0.18  Process Tag drive delay increases rapidly with WinSize + Match OR delay constant

A. Moshovos ©ECE Fall ‘07 ECE Toronto Wakeup Delay: Feature size 8 way & 64 entry window Tag drive and Tag match delays do not scale as well as MatchOR delay Match OR  logic delay Others  also have wire delays

A. Moshovos ©ECE Fall ‘07 ECE Toronto Selection Logic and Bypass Delay Selection –Logarithmically dependent on WinSize Bypass: Delay dependent on (IW)2