Caltech CS184b Winter2001 -- DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day9:

Slides:



Advertisements
Similar presentations
CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.
Advertisements

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 14: March 19, 2014 Compute 2: Cascades, ALUs, PLAs.
Instruction-Level Parallelism (ILP)
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 14, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
Balancing Interconnect and Computation in a Reconfigurable Array Dr. André DeHon BRASS Project University of California at Berkeley Why you don’t really.
Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day17: November 20, 2000 Time Multiplexing.
CS 7810 Lecture 2 Complexity-Effective Superscalar Processors S. Palacharla, N.P. Jouppi, J.E. Smith U. Wisconsin, WRL ISCA ’97.
1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 15: March 12, 2007 Interconnect 3: Richness.
University of Michigan Electrical Engineering and Computer Science FLASH: Foresighted Latency-Aware Scheduling Heuristic for Processors with Customized.
Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day6:
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.
CS294-6 Reconfigurable Computing Day 10 September 24, 1998 Interconnect Richness.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 13: February 26, 2007 Interconnect 1: Requirements.
1 EECS Components and Design Techniques for Digital Systems Lec 21 – RTL Design Optimization 11/16/2004 David Culler Electrical Engineering and Computer.
1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 14: February 28, 2007 Interconnect 2: Wiring Requirements and Implications.
Parallel Prefix Adders A Case Study
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 13: February 4, 2005 Interconnect 1: Requirements.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 15: February 12, 2003 Interconnect 5: Meshes.
ESE Spring DeHon 1 ESE534: Computer Organization Day 19: April 7, 2014 Interconnect 5: Meshes.
Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day3:
Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day12:
Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 2: January 21, 2015 Heterogeneous Multicontext Computing Array Work Preclass.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 8: April 26, 2001 Simultaneous Multi-Threading (SMT)
Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day15:
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 7: February 6, 2012 Memories.
Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day11: October 30, 2000 Interconnect Requirements.
Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.
Microprocessor Microarchitecture Limits of Instruction-Level Parallelism Lynn Choi Dept. Of Computer and Electronics Engineering.
Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 9: April 15, 2005 ILP 2.
Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day5:
Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day16:
A. Moshovos ©ECE Fall ‘07 ECE Toronto Out-of-Order Execution Structures.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 14: February 10, 2003 Interconnect 4: Switching.
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 18: April 2, 2014 Interconnect 4: Switching.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 16: February 14, 2005 Interconnect 4: Switching.
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 5: February 1, 2010 Memories.
Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.
Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day16: November 15, 2000 Retiming Structures.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 11: May10, 2001 Data Parallel (SIMD, SPMD, Vector)
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 13: February 6, 2003 Interconnect 3: Richness.
Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.
Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day8:
Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day7:
Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day11:
Penn ESE534 Spring DeHon 1 ESE534 Computer Organization Day 9: February 13, 2012 Interconnect Introduction.
Caltech CS184 Winter DeHon CS184a: Computer Architecture (Structure and Organization) Day 4: January 15, 2003 Memories, ALUs, Virtualization.
Csci 136 Computer Architecture II – Superscalar and Dynamic Pipelining Xiuzhen Cheng
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day14: November 10, 2000 Switching.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 12: February 5, 2003 Interconnect 2: Wiring Requirements.
15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 17: March 29, 2010 Interconnect 2: Wiring Requirements and Implications.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 20: February 27, 2005 Retiming 2: Structures and Balance.
CS184a: Computer Architecture (Structure and Organization)
Lynn Choi Dept. Of Computer and Electronics Engineering
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
ESE534: Computer Organization
ESE534: Computer Organization
CS184a: Computer Architecture (Structures and Organization)
CS184a: Computer Architecture (Structure and Organization)
CS184b: Computer Architecture (Abstractions and Optimizations)
Presentation transcript:

Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day9: February 1, 2000 SuperScalar Costs and Opportunities

Caltech CS184b Winter DeHon 2 Today Issue Window Registers Bypass Ultrascalar

Caltech CS184b Winter DeHon 3 Limit Studies Goal: understand how far you can go –this case, how much ILP can find Remove current/artificial limits –do full renaming –arbitrary look ahead –perfect control prediction Careful with assumptions –can still be pessemistic –is there another way to do it? –Another way around the limitation?

Caltech CS184b Winter DeHon 4 Available ILP Hennessy and Patterson 4.38

Caltech CS184b Winter DeHon 5 Window Size How many instructions forward do we look? –Only look at next = in-order issue Johnson Fig. 3.9 (32 issue window?)

Caltech CS184b Winter DeHon 6 Window Size [Hennessy and Patterson 4.39] There’s quite a bit of non-local parallelism.

Caltech CS184b Winter DeHon 7 Window Size [64-issues Hennessy and Patterson 4.47]

Caltech CS184b Winter DeHon 8 Operation Organization Consider Tree-structured calculation –freedom in ordering –consider: post-order traversal by levels from leaves –where is parallelism? –Storage cost?

Caltech CS184b Winter DeHon 9 Cost? Rsrc i  Rdst i-1 ; Rsrc i  Rdst i-2 ;… O(WS 2 ) comparisons

Caltech CS184b Winter DeHon 10 Cost? Anecdotal [Farrell, Fischer JSSC v33n5] –DEC 20-instruction queue –4 instruction issue – (80 physical registers) –10mm 2 in 0.35  m (300M 2 +) –600 MHz = 1.6ns

Caltech CS184b Winter DeHon 11 Costs? Both DEC and “Quantifying” (also DEC) –appear to use a scoreboarded scheme to avoid –accept not issue until result computed? “Quantifyng” suggests: –wakeup time  IW 2  WS 2 but assuming quadratic wire delay in length (never buffer wire) –but WS=F(IW) –certainly faster than linear time –A  IW  WS

Caltech CS184b Winter DeHon 12 Registers How many virtual registers needed? [Hennessy and Patterson 4.43]

Caltech CS184b Winter DeHon 13 Register Costs? First Order –area linear in number of registers –delay linear in number of registers Bank RF –maybe sublinear delay –at least square root number of registers wire delay sqrt of area

Caltech CS184b Winter DeHon 14 RF and IW interaction Larger Issue (Decode) –want to read/retire more registers per cycle –RF ports = 3 IW –A  ports * number –…and number of registers = F(IW) RF grows faster than linear

Caltech CS184b Winter DeHon 15 Bypass: Control Control comparison –every functional input (2 IW) –get input from every pipestage (d) from issue produce to wb for every result producer (IW) Total comparisons: d  IW 2

Caltech CS184b Winter DeHon 16 Bypass: Interconnect Linear layout –bypass span functional units and RF –physical RF grows with IW read/write ports more physical registers to support IW –FU bypass muxes grows with IW Consequently –height grows with IW (IW 2 ?) –cycle grow with IW?

Caltech CS184b Winter DeHon 17 Bypass: Interconnect “Quantifying” –quadratic wire delay –(but asymptotically, we can buffer) –largest delay component calculated (>1ns for IW=8) IW=8 about 5-6 times IW=4

Caltech CS184b Winter DeHon 18 Different Solution These assume Number of Regs > IW If IW>R, different approach… From Henry, Kuszmaul, et. Al. –ARVLSI’99 –SPAA’99 –ISCA’00

Caltech CS184b Winter DeHon 19 Consider Machine Each FU has a full RF Build network between FUs –use network to connect produce/consume –user register names to configure interconnect Signal data ready along network

Caltech CS184b Winter DeHon 20 Ultrascalar: concept model

Caltech CS184b Winter DeHon 21 Ultrascalar concept Linear delay O(1) register cost / FU Complete renaming at each FU –different set of registers –so when say complete reg at each FU, that’s only the logical registers

Caltech CS184b Winter DeHon 22 Ultrascalar: cyclic prefix

Caltech CS184b Winter DeHon 23 Parallel Prefix Basic idea is one we saw with adders An FU will either – produce a register (generate) –or transmit a register (propagate) –can do tree combining pair of FUs will either both propogate or will generate compute function by pair in one stage recurse to next stage get log-depth tree network connecting producer and consumer

Caltech CS184b Winter DeHon 24 Ultrascalar: cyclic prefix

Caltech CS184b Winter DeHon 25 Cyclic Prefix Gets delay down to log(WS) –w/ linear layout, delay still linear Issue into, retire from Window in order –serves rename shared RF issue bypass reorder

Caltech CS184b Winter DeHon 26 Ultrascalar: layout Register paths not growing. Wide, but constant width If Memory width <  n area goes as n wire goes as  n

Caltech CS184b Winter DeHon 27 Ultrascalar: asymptotics Assume M(n)<O(  n) –Area ~ n  R 2 –Delay ~ (  n)  R Claim can do –Area ~ n  R –Delay ~  (n  R) Memory grows faster, will dominate interconnect growth, hence area and delay –get extra term for memory growth (like Rent’s Rule)

Caltech CS184b Winter DeHon 28 UltraScalar: 0.25  m 128-window, 32 logical regs 64b ops ? 4-issue delays <2ns –comit, wakeup, schedule –wire delay dominate logic area ~1G 2 (? Includes all datapath)

Caltech CS184b Winter DeHon 29 Solution for: Object/binary compatibility is paramount Performance is King Recompilation not an option Cost (area, energy) is no object

Caltech CS184b Winter DeHon 30 Next Week …an alternative way to exploit ILP rely on compiler and feedback Tuesday: VLIW/Fisher Thursday: EPIC

Caltech CS184b Winter DeHon 31 (Semi?) Big Ideas Balance Size Matters Interconnect delay dominate As parameters grow –watch tradeoffs –widely different solutions prevail in different points in space (different asymptotes)