Caltech CS184 Spring2005 -- DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 9: April 15, 2005 ILP 2.

Slides:

Advertisements

Similar presentations

Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 4: April 7, 2003 Instruction Level Parallelism ILP.

Advertisements

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 14: March 19, 2014 Compute 2: Cascades, ALUs, PLAs.

Instruction-Level Parallelism (ILP)

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

CS 7810 Lecture 2 Complexity-Effective Superscalar Processors S. Palacharla, N.P. Jouppi, J.E. Smith U. Wisconsin, WRL ISCA ’97.

1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 15: March 12, 2007 Interconnect 3: Richness.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.

1 EECS Components and Design Techniques for Digital Systems Lec 21 – RTL Design Optimization 11/16/2004 David Culler Electrical Engineering and Computer.

Computer ArchitectureFall 2007 © October 31, CS-447– Computer Architecture M,W 10-11:20am Lecture 17 Review.

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 14: February 28, 2007 Interconnect 2: Wiring Requirements and Implications.

OOO execution © Avi Mendelson, 4/ MAMAS – Computer Architecture Lecture 7 – Out Of Order (OOO) Avi Mendelson Some of the slides were taken.

1  1998 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining.

Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 13: February 4, 2005 Interconnect 1: Requirements.

Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 15: February 12, 2003 Interconnect 5: Meshes.

ESE Spring DeHon 1 ESE534: Computer Organization Day 19: April 7, 2014 Interconnect 5: Meshes.

Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 7: February 6, 2012 Memories.

Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.

Microprocessor Microarchitecture Limits of Instruction-Level Parallelism Lynn Choi Dept. Of Computer and Electronics Engineering.

1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.

Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day5:

Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day9:

A. Moshovos ©ECE Fall ‘07 ECE Toronto Out-of-Order Execution Structures.

EE204 L12-Single Cycle DP PerformanceHina Anwar Khan EE204 Computer Architecture Single Cycle Data path Performance.

Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 14: February 10, 2003 Interconnect 4: Switching.

Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 11: April 30, 2003 Multithreading.

COMP25212 Lecture 51 Pipelining Reducing Instruction Execution Time.

Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 18: April 2, 2014 Interconnect 4: Switching.

Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 16: February 14, 2005 Interconnect 4: Switching.

CS 1104 Help Session IV Five Issues in Pipelining Colin Tan, S

Computer Organization CDA 3103 Dr. Hassan Foroosh Dept. of Computer Science UCF © Copyright Hassan Foroosh 2002.

Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day16: November 15, 2000 Retiming Structures.

1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.

CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 11: May10, 2001 Data Parallel (SIMD, SPMD, Vector)

Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 13: February 6, 2003 Interconnect 3: Richness.

Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.

Instructor: Senior Lecturer SOE Dan Garcia CS 61C: Great Ideas in Computer Architecture Pipelining Hazards 1.

Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day8:

Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day7:

Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day11:

Penn ESE534 Spring DeHon 1 ESE534 Computer Organization Day 9: February 13, 2012 Interconnect Introduction.

Caltech CS184 Winter DeHon CS184a: Computer Architecture (Structure and Organization) Day 4: January 15, 2003 Memories, ALUs, Virtualization.

Csci 136 Computer Architecture II – Superscalar and Dynamic Pipelining Xiuzhen Cheng

Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.

On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.

Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day14: November 10, 2000 Switching.

Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 3: April 4, 2003 Pipelined ISA.

Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 12: February 5, 2003 Interconnect 2: Wiring Requirements.

Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.

CSE431 L13 SS Execute & Commit.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 13: SS Backend (Execute, Writeback & Commit) Mary Jane.

Lynn Choi Dept. Of Computer and Electronics Engineering

Instruction Level Parallelism and Superscalar Processors

Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Adapted from the slides of Prof

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

Adapted from the slides of Prof

CS184a: Computer Architecture (Structures and Organization)

Guest Lecturer: Justin Hsia

CS184b: Computer Architecture (Abstractions and Optimizations)

Presentation transcript:

Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 9: April 15, 2005 ILP 2

Caltech CS184 Spring DeHon 2 Today ILP Limits Practical Issues –Finite size issues Cost Scaling Ultrascalar

Caltech CS184 Spring DeHon 3 Limit Studies Goal: understand how far you can go –this case, how much ILP can find Remove current/artificial limits –do full renaming, arbitrary look ahead –perfect control prediction, memory disambiguation Careful with assumptions –can still be pessimistic –is there another way to do it? –another way around the limitation?

Caltech CS184 Spring DeHon 4 Available ILP [Hennessy and Patterson 4.38e2/3.35e3]

Caltech CS184 Spring DeHon 5 What do we achieve today? Pentium … < 1 instruction/cycle retired –But low cycle time –Time= CPI  Instructions  CycleTime Not seen attempts to issue more than 4 instructions/cycle –Much less sustain retire or more than 4

Caltech CS184 Spring DeHon 6 Limit Effects

Caltech CS184 Spring DeHon 7 Superscalar IF Decode Queue EX ALU MPY LD/ST RF RUU Fetch Width Window Size Physical Registers Issue Width Number/Types Of Functional Units

Caltech CS184 Spring DeHon 8 Window Size (unlimited issue) [Hennessy and Patterson 4.39e2/3.36e3] There’s quite a bit of non-local parallelism.

Caltech CS184 Spring DeHon 9 Window Size (64-Issue limited) [64-issues Hennessy and Patterson 4.47e2/3.45e3]

Caltech CS184 Spring DeHon 10 Operation Organization Consider Tree-structured calculation –freedom in ordering –consider: post-order traversal by levels from leaves –where is parallelism? –Storage cost?

Caltech CS184 Spring DeHon 11 Window Size How many instructions forward do we look? –Only look at next = in-order issue Johnson Fig. 3.9 (32 issue window?)

Caltech CS184 Spring DeHon 12 Branch Prediction [Hennessey & Patterson Fig 3.38/e3]

Caltech CS184 Spring DeHon 13 Window Cost? Check no one before you in the window writes a value you need Rsrc i  Rdst i-1 ; Rsrc i  Rdst i-2 ;… O(WS 2 ) comparisons

Caltech CS184 Spring DeHon 14 Cost? Anecdotal [Farrell, Fischer JSSC v33n5] –DEC 20-instruction queue –4 instruction issue – (80 physical registers) –10mm 2 in 0.35  m (300M 2 +) Compare: –300 4-LUTs (w/ interconnect) –MIPS-X 32b CPU w/ 1KB memory = 68M 2 –600 MHz = 1.6ns

Caltech CS184 Spring DeHon 15 Costs? Both DEC and “Quantifying” (also DEC) –appear to use a scoreboarded scheme to avoid –accept not issue until result computed? “Quantifyng” suggests: –wakeup time  IW 2  WS 2 but assuming quadratic wire delay in length (never buffer wire) –but WS=F(IW) –Certainly grows faster than linear time –A  IW  WS

Caltech CS184 Spring DeHon 16 Registers How many virtual registers needed? [Hennessy and Patterson 4.43e2/3.41e3]

Caltech CS184 Spring DeHon 17 Register Costs? First Order –area linear in number of registers –delay linear in number of registers Bank RF –maybe sublinear delay –at least square root number of registers wire delay sqrt of area

Caltech CS184 Spring DeHon 18 RF and IW interaction Larger Issue (Decode) –want to read/retire more registers per cycle –RF ports = 3 IW [Op Rdst  Rsrc1,Rsrc2] –A  ports  number –…and number of registers = F(IW) –A  IW  F(IW) RF grows faster than linear

Caltech CS184 Spring DeHon 19 Bypass: Control Control comparison –every functional input (2 IW) –get input from every pipestage (d) from issue produce to wb for every result producer (>IW) Total comparisons: d  IW 2

Caltech CS184 Spring DeHon 20 Bypass: Interconnect Linear layout –bypass span functional units and RF –physical RF grows with IW read/write ports more physical registers to support IW –FU bypass muxes grows with IW Consequently –width grows with IW –cycle grow with IW?

Caltech CS184 Spring DeHon 21 Bypass: Interconnect “Quantifying” –quadratic wire delay –(but asymptotically, we can buffer) –largest delay component calculated (>1ns for IW=8) [180nm] IW=8 about 5-6 times IW=4

Caltech CS184 Spring DeHon 22 Aliasing Do memory operations depend on one another? E.g. A[j+3]=x*x+y; Z=A[i-2]+A[i+2] Is A[i-2], A[i+2] another name for A[j+3]? E.g. *a++; *b+=3; *a++; d=*c+3; Are these operations all independent? Or do some name the same memory locaiton?

Caltech CS184 Spring DeHon 23 Aliasing [Hennessey & Patterson Fig 3.43/e3]

Caltech CS184 Spring DeHon 24 …And now for something Completely Different

Caltech CS184 Spring DeHon 25 Different Solution These assume Number of Regs > IW If IW>R, different approach… From Henry, Kuszmaul, et. al. –ARVLSI’99 –SPAA’99 –ISCA’00

Caltech CS184 Spring DeHon 26 Consider Machine Each FU has a full RF Build network between FUs –use network to connect produce/consume –user register names to configure interconnect Signal data ready along network

Caltech CS184 Spring DeHon 27 Ultrascalar: concept model

Caltech CS184 Spring DeHon 28 Ultrascalar concept Linear delay O(1) register cost / FU Complete renaming at each FU –different set of registers –so when say complete RF at each FU, that’s only the logical registers

Caltech CS184 Spring DeHon 29 Ultrascalar: cyclic prefix

Caltech CS184 Spring DeHon 30 Parallel Prefix Basic idea is one we saw with adders An FU will either – produce a register (generate) –or transmit a register (propagate) –can do tree combining pair of FUs will either both propagate or will generate compute function by pair in one stage recurse to next stage get log-depth tree network connecting producer and consumer

Caltech CS184 Spring DeHon 31 Ultrascalar: cyclic prefix

Caltech CS184 Spring DeHon 32 Cyclic Prefix Gets delay down to log(WS) –w/ linear layout, delay still linear Issue into, retire from Window in order –serves rename shared RF issue bypass reorder

Caltech CS184 Spring DeHon 33 Ultrascalar: layout Register paths not growing. (p=0 tree!) Wide, but constant width If Memory width <  n area goes as n wire goes as  n

Caltech CS184 Spring DeHon 34 Ultrascalar: asymptotics Assume M(n)<O(  n) –Area ~ n  R 2 –Delay ~ (  n)  R Claim can do –Area ~ n  R –Delay ~  (n  R) If memory grows faster, will dominate interconnect growth, hence area and delay –get extra term for memory growth (like Rent’s Rule)

Caltech CS184 Spring DeHon 35 UltraScalar: 0.25  m 128-window, 32 logical regs 64b ops ? 8 instruction fetch delays <2ns [0.25  m] –commit, wakeup, schedule –wire delay dominate logic area ~2G 2 (not include datapath)

Caltech CS184 Spring DeHon 36 Solution for: Object/binary compatibility is paramount Performance is King Recompilation not an option Cost (area, energy) is no object

Caltech CS184 Spring DeHon 37 (Semi?) Big Ideas Good to look at –Extremes (what can this possibly do?) –Sensitivity (how important is this to…) Balance Size Matters Interconnect delay dominate As parameters grow –watch tradeoffs –widely different solutions prevail in different points in space (different asymptotes)