EECS 470 Control Hazards and ILP Lecture 3 – Fall 2013 Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen,

Slides:



Advertisements
Similar presentations
CSE 502: Computer Architecture
Advertisements

Computer Architecture Instruction-Level Parallel Processors
ILP: IntroductionCSCE430/830 Instruction-level parallelism: Introduction CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.
1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.
Lecture 4 Slide 1 EECS 470 Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch EECS 470.
Instruction-Level Parallelism (ILP)
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
EECS 470 Lecture 7 Branches: Address prediction and recovery (And interrupt recovery too.)
EECE476: Computer Architecture Lecture 23: Speculative Execution, Dynamic Superscalar (text 6.8 plus more) The University of British ColumbiaEECE 476©
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.
1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
EECS 470 Pipeline Control Hazards Lecture 5 Coverage: Chapter 3 & Appendix A.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.
1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
Pipelining to Superscalar Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
OOO execution © Avi Mendelson, 4/ MAMAS – Computer Architecture Lecture 7 – Out Of Order (OOO) Avi Mendelson Some of the slides were taken.
EECS 470 Further review: Pipeline Hazards and More Lecture 2 – Winter 2014 Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti,
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
Lecture 13 Slide 1 EECS 470 © Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch.
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
1 Lecture 5 Overview of Superscalar Techniques CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading: Textbook, Ch. 2.1 “Complexity-Effective.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
ECE 252 / CPS 220 Pipelining Professor Alvin R. Lebeck Compsci 220 / ECE 252 Fall 2008.
EECS 470 Control Hazards and ILP Lecture 3 – Winter 2015 Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth,
1 Out-Of-Order Execution (part I) Alexander Titov 14 March 2015.
1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.
Recap Multicycle Operations –MIPS Floating Point Putting It All Together: the MIPS R4000 Pipeline.
1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
ECE/CS 552: Pipeline Hazards © Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim.
1 Lecture: Pipelining Extensions Topics: control hazards, multi-cycle instructions, pipelining equations.
Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
CS 352H: Computer Systems Architecture
Dynamic Scheduling Why go out of style?
/ Computer Architecture and Design
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
CS203 – Advanced Computer Architecture
Lecture: Out-of-order Processors
ECE/CS 552: Pipelining to Superscalar
Sequential Execution Semantics
Morgan Kaufmann Publishers The Processor
Lecture 6: Advanced Pipelines
Out of Order Processors
Lecture 10: Out-of-order Processors
Lecture 11: Out-of-order Processors
Lecture: Out-of-order Processors
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
CSCE 432/832 High Performance Processor Architectures Scalar to Superscalar Adopted from Lecture notes based in part on slides created by Mikko H. Lipasti,
Lecture 11: Memory Data Flow Techniques
Pipelining to Superscalar ECE/CS 752 Fall 2017
Lecture: Out-of-order Processors
Lecture 8: Dynamic ILP Topics: out-of-order processors
How to improve (decrease) CPI
Instruction Level Parallelism (ILP)
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
ECE/CS 552: Pipelining to Superscalar
CSC3050 – Computer Architecture
Lecture 9: Dynamic ILP Topics: out-of-order processors
Conceptual execution on a processor which exploits ILP
Presentation transcript:

EECS 470 Control Hazards and ILP Lecture 3 – Fall 2013 Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, and Wenisch of Carnegie Mellon University, Purdue University, University of Michigan, University of Pennsylvania, and University of Wisconsin.

Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch 2 Announcements HW1 due today Project 1 due Sunday by 6pm. –Note: can submit multiple times, only last one counts. HW2 posted over the weekend.

Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch 3 Getting help Mark Brehob Office hours: (4632 Beyster unless noted) –Monday 9:30-11:00am –Tuesday 3:30-5:00pm EECS 2334, which is the 373 lab--you'll need to grab me and let me know you have an EECS 470 question. –Wednesday 5:00-5:45pm –Thursday 4:00-5:30pm GSIs/IA (DC3NE) Monday: 5:30-7:30 (William), 7:00-9:00 (Jason) Tuesday: 5:00-7:00 (Jon) Wednesday: 4:30-6:30 (Jon) Thursday: 1:00-3:00 (William) Friday 5:30-6:30 (in lab, Jon and William) Saturday: 6:00-9:00 (Jon) Sunday: 6:00-9:00 (William) Intro Don’t forget Piazza. If you don’t have access, drop William an ASAP.

Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch Readings –H & P Chapter C, 3.1, Reading all of 3 at some point, so may want to read it all now… –Book note: Book is on-line so you don’t need to buy it. But exams are open book and open notes… –Won’t need the text, but…

Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch Today Control hazards (again) Costs and Power Instruction Level Parallelism (ILP) and Dynamic Execution

Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch Handling Control Hazards Avoidance (static) –No branches? –Convert branches to predication Control dependence becomes data dependence Detect and Stall (dynamic) –Stop fetch until branch resolves Speculate and squash (dynamic) –Keep going past branch, throw away instructions if wrong Pipelining & Control Hazards

Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch Avoidance Via Predication if (a == b) { x++; y = n / d; } subt1  a, b jnzt1, PC+2 addx  x, #1 divy  n, d sub t1  a, b add(t1) x  x, #1 div(t1) y  n, d sub t1  a, b add t2  x, #1 div t3  n, d cmov(t1) x  t2 cmov(t1) y  t3 Pipelining & Control Hazards Avoidance Detect and Stall Speculate and Squash

Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch Handling Control Hazards: Detect & Stall Detection – In decode, check if opcode is branch or jump Stall – Hold next instruction in Fetch – Pass noop to Decode Pipelining & Control Hazards Avoidance Detect and Stall Speculate and Squash

Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch Handling Control Hazards: Speculate & Squash Speculate “Not-Taken” – Assume branch is not taken Squash –Overwrite opcodes in Fetch, Decode, Execute with noop –Pass target to Fetch Pipelining & Control Hazards Avoidance Detect and Stall Speculate and Squash

Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch Problems with Speculate & Squash Always assumes branch is not taken Can we do better? Yes. – Predict branch direction and target! – Why possible? Program behavior repeats. More on branch prediction to come... Pipelining & Control Hazards Avoidance Detect and Stall Speculate and Squash

Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch The cost of computing… $ and Watts 11 The cost of computing… $ and Watts

Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch 12 Digital System Cost Cost is also a key design constraint –Architecture is about trade-offs –Cost plays a major role Huge difference between Cost & Price E.g., –Higher Price  Lower Volume  Higher Cost  Higher Price –Direct Cost –List vs. Selling Price Price also depends on the customer –College student vs. US Government SupercomputerServersDesktopsPortablesEmbedded $$$$$$$ The cost of computing… $ and Watts

Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch 13 Direct Cost Cost distribution for a Personal Computer –Processor board 37% CPU, memory, –I/O devices 37% Hard disk, DVD, monitor, … –Software 20% –Tower/cabinet 6% Integrated systems account for a substantial fraction of cost The cost of computing… $ and Watts

Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch 14 IC Cost Equation Die cost + Test cost + Packaging cost IC cost = Final test yield Wafer cost Die cost = Dies/wafer x Die yield Die yield = f(defect density, die area) The cost of computing… $ and Watts

Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch Why is power a problem in a μP? Power used by the μP, vs. system power Dissipating Heat –Melting (very bad) –Packaging (to cool  $) –Heat leads to poorer performance. Providing Power –Battery –Cost of electricity Introduction The cost of computing… $ and Watts

Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch Where does the juice go in laptops? Others have measured ~55% processor increase under max load in laptops [Hsu+Kremer, 2002] The cost of computing… $ and Watts

Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch Why worry about power dissipation? Environment Thermal issues: affect cooling, packaging, reliability, timing Battery life Why is power a problem? The cost of computing… $ and Watts

Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch Power-Aware Computing Applications Energy-Constrained Computing Temperature/Current-Constrained The cost of computing… $ and Watts

Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch More Performance Optimization Exploiting Instruction Level Parallelism Exploiting ILP: Basics Measuring ILP Dynamic execution

Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch Limitations of Scalar Pipelines Upper Bound on Scalar Pipeline Throughput Limited by IPC=1 “Flynn Bottleneck” Performance Lost Due to Rigid In-order Pipeline Unnecessary stalls Exploiting ILP: Basics Measuring ILP Dynamic execution

Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch Terms Instruction parallelism –Number of instructions being worked on Operation Latency –The time (in cycles) until the result of an instruction is available for use as an operand in a subsequent instruction. For example, if the result of an Add instruction can be used as an operand of an instruction that is issued in the cycle after the Add is issued, we say that the Add has an operation latency of one. Peak IPC –The maximum sustainable number of instructions that can be executed per clock. # Performance modeling for computer architects, C. M. Krishna Exploiting ILP: Basics Measuring ILP Dynamic execution

Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch Architectures for Exploiting Instruction-Level Parallelism Exploiting ILP: Basics Measuring ILP Dynamic execution

Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch Superscalar Machine Exploiting ILP: Basics Measuring ILP Dynamic execution

Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch What is the real problem? CPI of in-order pipelines degrades very sharply if the machine parallelism is increased beyond a certain point. i.e., when NxM approaches average distance between dependent instructions Forwarding is no longer effective Pipeline may never be full due to frequent dependency stalls! Exploiting ILP: Basics Measuring ILP Dynamic execution

Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch Missed Speedup in In-Order Pipelines What’s happening in cycle 4? – mulf stalls due to RAW hazard OK, this is a fundamental problem – subf stalls due to pipeline hazard Why? subf can’t proceed into D because mulf is there That is the only reason, and it isn’t a fundamental one Why can’t subf go into D in cycle 4 and E+ in cycle 5? addf f0,f1,f2 FDE+ W mulf f2,f3,f2 FDd* E* W subf f0,f1,f4 Fp* DE+ W Exploiting ILP: Basics Measuring ILP Dynamic execution

Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch The Problem With In-Order Pipelines In-order pipeline –Structural hazard: 1 insn register (latch) per stage 1 instruction per stage per cycle (unless pipeline is replicated) Younger instr. can’t “pass” older instr. without “clobbering” it Out-of-order pipeline –Implement “passing” functionality by removing structural hazard regfile D$ I$ BPBP Exploiting ILP: Basics Measuring ILP Dynamic execution

Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch New Pipeline Terminology In-order pipeline –Often written as F,D,X,W (multi-cycle X includes M) –Variable latency 1-cycle integer (including mem) 3-cycle pipelined FP regfile D$ I$ BPBP Exploiting ILP: Basics Measuring ILP Dynamic execution

Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch ILP: Instruction-Level Parallelism ILP is a measure of the amount of inter-dependencies between instructions Average ILP =no. instruction / no. cyc required code1: ILP = 1 i.e. must execute serially code2: ILP = 3 i.e. can execute at the same time code1: r1  r2 + 1 r3  r1 / 17 r4  r0 - r3 code2:r1  r2 + 1 r3  r9 / 17 r4  r0 - r10 Exploiting ILP: Basics Measuring ILP Dynamic execution

Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch Purported Limits on ILP Weiss and Smith [1984]1.58 Sohi and Vajapeyam [1987]1.81 Tjaden and Flynn [1970]1.86 Tjaden and Flynn [1973]1.96 Uht [1986]2.00 Smith et al. [1989]2.00 Jouppi and Wall [1988]2.40 Johnson [1991]2.50 Acosta et al. [1986]2.79 Wedig [1982]3.00 Butler et al. [1991]5.8 Melvin and Patt [1991]6 Wall [1991]7 Kuck et al. [1972]8 Riseman and Foster [1972]51 Nicolau and Fisher [1984] 90 Exploiting ILP: Basics Measuring ILP Dynamic execution

Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch Scope of ILP Analysis r1  r2 + 1 r3  r1 / 17 r4  r0 - r3 r11  r r13  r11 / 17 r14  r13 - r20 ILP=2 ILP=1 Out-of-order execution exposes more ILP ILP=1 Exploiting ILP: Basics Measuring ILP Dynamic execution

Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch How Large Must the “Window” Be? Exploiting ILP: Basics Measuring ILP Dynamic execution

Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch Dynamic Scheduling – OoO Execution Dynamic scheduling –Totally in the hardware –Also called “out-of-order execution” (OoO) Fetch many instructions into instruction window –Use branch prediction to speculate past (multiple) branches –Flush pipeline on branch misprediction Rename to avoid false dependencies (WAW and WAR) Execute instructions as soon as possible –Register dependencies are known –Handling memory dependencies more tricky (much more later) Commit instructions in order –Any strange happens before commit, just flush the pipeline Current machines: 100+ instruction scheduling window Exploiting ILP: Basics Measuring ILP Dynamic execution

Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch Motivation for Dynamic Scheduling Dynamic scheduling (out-of-order execution) –Execute instructions in non-sequential order… +Reduce RAW stalls +Increase pipeline and functional unit (FU) utilization –Original motivation was to increase FP unit utilization +Expose more opportunities for parallel issue (ILP) –Not in-order  can be in parallel –…but make it appear like sequential execution Important –But difficult Next few lectures Exploiting ILP: Basics Measuring ILP Dynamic execution

Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch regfile D$ I$ BPBP insn buffer SD add p2,p3,p4 sub p2,p4,p5 mul p2,p5,p6 div p4,4,p7 Ready Table P2P3P4P5P6P7 Yes div p4,4,p7 mul p2,p5,p6 sub p2,p4,p5 add p2,p3,p4 and Dynamic Scheduling: The Big Picture Instructions fetch/decoded/renamed into Instruction Buffer –Also called “instruction window” or “instruction scheduler” Instructions (conceptually) check ready bits every cycle –Execute when ready t

Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch Going Forward: What’s Next We’ll build this up in steps over the next few weeks –Register renaming to eliminate “false” dependencies –“Tomasulo’s algorithm” to implement OoO execution –Handling precise state and speculation –Handling memory dependencies Let’s get started! Exploiting ILP: Basics Measuring ILP Dynamic execution

Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch 36 Dependency vs. Hazard A dependency exists independent of the hardware. –So if Inst #1’s result is needed for Inst #1000 there is a dependency –It is only a hazard if the hardware has to deal with it. So in our pipelined machine we only worried if there wasn’t a “buffer” of two instructions between the dependent instructions. Dynamic execution

Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch 37 True Data dependencies True data dependency –RAW – Read after Write R1=R2+R3 R4=R1+R5 True dependencies prevent reordering –(Mostly) unavoidable Dynamic execution Hazards Renaming

Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch False Data Dependencies False or Name dependencies –WAW – Write after Write R1=R2+R3 R1=R4+R5 –WAR – Write after Read R2=R1+R3 R1=R4+R5 False dependencies prevent reordering –Can they be eliminated? (Yes, with renaming!) 38 Dynamic execution Hazards Renaming

Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch 39 Data Dependency Graph: Simple example R1=MEM[R2+0] // A R2=R2+4 // B R3=R1+R4 // C MEM[R2+0]=R3 // D RAW WAW WAR Dynamic execution Hazards Renaming

Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch 40 Data Dependency Graph: More complex example R1=MEM[R3+4] // A R2=MEM[R3+8] // B R1=R1*R2 // C MEM[R3+4]=R1 // D MEM[R3+8]=R1 // E R1=MEM[R3+12] // F R2=MEM[R3+16] // G R1=R1*R2 // H MEM[R3+12]=R1 // I MEM[R3+16]=R1 // J RAW WAW WAR Dynamic execution Hazards Renaming

Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch 41 Eliminating False Dependencies R1=MEM[R3+4] // A R2=MEM[R3+8] // B R1=R1*R2 // C MEM[R3+4]=R1 // D MEM[R3+8]=R1 // E R1=MEM[R3+12] // F R2=MEM[R3+16] // G R1=R1*R2 // H MEM[R3+12]=R1 // I MEM[R3+16]=R1 // J Well, logically there is no reason for F-J to be dependent on A-E. So….. ABFG CH DEIJ –Should be possible. But that would cause either C or H to have the wrong reg inputs How do we fix this? –Remember, the dependency is really on the name of the register –So… change the register names! Dynamic execution Hazards Renaming

Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch 42 Register Renaming Concept –The register names are arbitrary –The register name only needs to be consistent between writes. R1= ….. …. = R1 …. ….= … R1 R1= …. The value in R1 is “alive” from when the value is written until the last read of that value. Dynamic execution Hazards Renaming

Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch 43 So after renaming, what happens to the dependencies? P1=MEM[R3+4] //A P2=MEM[R3+8] //B P3=P1*P2 //C MEM[R3+4]=P3 //D MEM[R3+8]=P3 //E P4=MEM[R3+12] //F P5=MEM[R3+16] //G P6=P4*P5 //H MEM[R3+12]=P6 //I MEM[R3+16]=P6 //J RAW WAW WAR Dynamic execution Hazards Renaming

Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch Register Renaming Approach Every time an architected register is written we assign it to a physical register –Until the architected register is written again, we continue to translate it to the physical register number –Leaves RAW dependencies intact It is really simple, let’s look at an example: –Names: r1,r2,r3 –Locations: p1,p2,p3,p4,p5,p6,p7 –Original mapping: r1  p1, r2  p2, r3  p3, p4 – p7 are “free” MapTableFreeListOrig. insnsRenamed insns r1r2r3 p1p2p3p4,p5,p6,p7add r2,r3,r1add p2,p3,p4 p4p2p3p5,p6,p7sub r2,r1,r3sub p2,p4,p5 p4p2p5p6,p7mul r2,r3,r3mul p2,p5,p6 p4p2p6p7div r1,4,r1div p4,4,p7 Dynamic execution Hazards Renaming

Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch 45 Register Renaming Hardware Really simple table (rename table) –Every time an instruction which writes a register is encountered assign it a new physical register number But there is some complexity –When do you free physical registers? Dynamic execution Hazards Renaming