Lecture 12 Reorder Buffers CSCE 513 Computer Architecture Lecture 12 Reorder Buffers Topics Tomasulo’s Loop example Speculation Reorder Buffers Readings: October 16, 2017
Overview Last Time New References Control Hazards: Lecture 7 slides 27-32 Data Hazards Review Tomasulo Overview, examples New Tomasulo Overview, examples revisited Figures 2.10 right one, 2.11 Tomasulo’s Algorithm details fig 2.12 Tomasulo + ReOrder Buffer (ROB) fig 2.14, 2.15, 2.16 References Chapter 2 section 2.6 Test 1
The University of Adelaide, School of Computer Science 18 September 2018 Dynamic Scheduling Branch Prediction Dynamic scheduling implies: Out-of-order execution Out-of-order completion Creates the possibility for WAR and WAW hazards Tomasulo’s Approach Tracks when operands are available Introduces register renaming in hardware Minimizes WAW and WAR hazards Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer
The University of Adelaide, School of Computer Science 18 September 2018 Register Renaming Branch Prediction Example: DIV.D F0,F2,F4 ADD.D F6,F0,F8 S.D F6,0(R1) SUB.D F8,F10,F14 MUL.D F6,F10,F8 + name dependence with F6 antidependence antidependence Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer
Figure 2.9 Tomasulo CDB Register Renaming
The University of Adelaide, School of Computer Science 18 September 2018 Tomasulo’s Algorithm Branch Prediction Three Steps: Issue Get next instruction from FIFO queue If available RS, issue the instruction to the RS with operand values if available If operand values not available, stall the instruction Execute When operand becomes available, store it in any reservation stations waiting for it When all operands are ready, issue the instruction Loads and store maintained in program order through effective address No instruction allowed to initiate execution until all branches that proceed it in program order have completed Write result Write result on CDB into reservation stations and store buffers (Stores must wait until address and value are received) Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer
Example (new and improved in 5th edition) The University of Adelaide, School of Computer Science 18 September 2018 Example (new and improved in 5th edition) Branch Prediction Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer
Figure 3.8 3
Data-Flow graph
Figure 3.9.a Tomasulo Issue
Figure 3.9.b Tomasulo Execute
Figure 3.9.c Tomasulo Write Result
Tomasulo Loop Example Loop: L.D F0, 0(R1) MUL.D F4, F0, F2 S.D F4, 0(R1) DADDIU R1, R1, -8 BNE R1, R2, Loop Dynamic loop unrolling of floating/LD point operations
Observations on Tomasulo’s Alg Tomasulo designed for the IBM 360/91 http://www.columbia.edu/acis/history/36091.html Does not require compiler to do all of the work Changes to hardware do not require changes to compiler (adding another multiplier) Designed before caches, but OoOE really helps with cache misses Dynamic scheduling required for “speculation”
Figure 3.12 Tomasulo + ROB example
Figure 3.10 - Two active Iterations of loop
Reorder Buffers
Speculation Issue Execute Write result Commit
Koren’s Tools Again http://www.ecs.umass.edu/ece/koren/architecture/
Figure 2.15 Tomasulo + ROB example
Figure 2.16 Tomasulo + ROB example
Fig 2.17a Tomasulo+ROB Details
Fig 2.17b Tomasulo+ROB Execute
Fig 2.17c Tomasulo+ROB Write-result
Fig 2.17d Tomasulo+ROB Commit
Figure 2.18 Multiple Issue Approaches
Unrolling for VLIW For i=1,10000 x[i] = x[i]+ c Loop: L.D F0, 0(R1) ADD.D F4,F0,F2 S.D F4, 0(R1) DADDUI R1, R1, -8 BNE R1,R2, loop Registers for Load Sum F0 F4 F6 F8 F10 F12 F14 F16 F18 F20 F22 F24 F26 F28
Figure 2.19 VLIW
Advanced Techniques for Instruction Delivery and Speculation Increasing Instruction Fetch Bandwidth Branch Target Buffers
When is the Branch Target Address available? Fig ? Appendix A
Figure A.24 – getting the branch target quicker
When is the Branch Target Address available?
Pentium 4 (sec 2.10) Front end –decoder IA32 instructions micro-ops (uops) which are RISC-like 3 IA32 instructions can be decoded per cycle upto 6 uops Uops are executed using a out-of-order speculative pipeline (using reg. renaming instead of ROB) Pentium 3 required at least 11 cycles for an instruction to go from fetch to “retire” Pentinum 4 pipeline depth continued to increase 21 cycles allowing 1.5GHz 31 cycles allowing 3.2GHz
Figure 2-26 Pentium 4 (Prescott)
Figure 2-27 Pentium 4 (Prescott) .
Tomasulo + Re-Order Buffer (ROB) http://www.ecs.umass.edu/ece/koren/architecture Configuration Defaults except: F0 F15 in operands for the loads FU latencies: FP-Adder: 2 FP-Multiplier: 6 FP-Divider: 12 Load latency: 2 Start simulation, then Clock+1 to step through
From Memory FP adder FP multipler Dest Instr Value Ready ROB Example pp 108 From Memory Registers To Memory Dest Addr Dest Op Vj/Qj Vk/Qk Dest Op Vj/Qj Vk/Qk FP adder FP multipler Reg F0 F2 F4 F6 F8 F10 F12 F14 ROB Busy
From Memory FP adder FP multipler Dest Instr Value Ready ROB Example pp 108 From Memory Registers To Memory Dest Addr Dest Op Vj/Qj Vk/Qk Dest Op Vj/Qj Vk/Qk FP adder FP multipler Reg F0 F2 F4 F6 F8 F10 F12 F14 ROB Busy
From Memory FP adder FP multipler Dest Instr Value Ready ROB Example pp 108 From Memory Registers To Memory Dest Addr Dest Op Vj/Qj Vk/Qk Dest Op Vj/Qj Vk/Qk FP adder FP multipler Reg F0 F2 F4 F6 F8 F10 F12 F14 ROB Busy
From Memory FP adder FP multipler Dest Instr Value Ready ROB Example pp 108 From Memory Registers To Memory Dest Addr Dest Op Vj/Qj Vk/Qk Dest Op Vj/Qj Vk/Qk FP adder FP multipler Reg F0 F2 F4 F6 F8 F10 F12 F14 ROB Busy
From Memory FP adder FP multipler Dest Instr Value Ready ROB Example pp 108 From Memory Registers To Memory Dest Addr Dest Op Vj/Qj Vk/Qk Dest Op Vj/Qj Vk/Qk FP adder FP multipler Reg F0 F2 F4 F6 F8 F10 F12 F14 ROB Busy
From Memory FP adder FP multipler Dest Instr Value Ready ROB Example pp 108 From Memory Registers To Memory Dest Addr Dest Op Vj/Qj Vk/Qk Dest Op Vj/Qj Vk/Qk FP adder FP multipler Reg F0 F2 F4 F6 F8 F10 F12 F14 ROB Busy
From Memory FP adder FP multipler Dest Instr Value Ready ROB Example pp 108 From Memory Registers To Memory Dest Addr Dest Op Vj/Qj Vk/Qk Dest Op Vj/Qj Vk/Qk FP adder FP multipler Reg F0 F2 F4 F6 F8 F10 F12 F14 ROB Busy
From Memory FP adder FP multipler Dest Instr Value Ready ROB Example pp 108 From Memory Registers To Memory Dest Addr Dest Op Vj/Qj Vk/Qk Dest Op Vj/Qj Vk/Qk FP adder FP multipler Reg F0 F2 F4 F6 F8 F10 F12 F14 ROB Busy
From Memory FP adder FP multipler Dest Instr Value Ready ROB Example pp 108 From Memory Registers To Memory Dest Addr Dest Op Vj/Qj Vk/Qk Dest Op Vj/Qj Vk/Qk FP adder FP multipler Reg F0 F2 F4 F6 F8 F10 F12 F14 ROB Busy
From Memory FP adder FP multipler Dest Instr Value Ready ROB Example pp 108 From Memory Registers To Memory Dest Addr Dest Op Vj/Qj Vk/Qk Dest Op Vj/Qj Vk/Qk FP adder FP multipler Reg F0 F2 F4 F6 F8 F10 F12 F14 ROB Busy
From Memory FP adder FP multipler Dest Instr Value Ready ROB Example pp 108 From Memory Registers To Memory Dest Addr Dest Op Vj/Qj Vk/Qk Dest Op Vj/Qj Vk/Qk FP adder FP multipler Reg F0 F2 F4 F6 F8 F10 F12 F14 ROB Busy
Tomasulo Example Page 98
Memory FP adder FP multipler Tomasulo’s Example pp 98 Instruction Issue Execute WriteResult L.D F6, 32(R2) MUL.D F0, F2, F4 Cycle Memory Dest Addr Busy Op Vj/Qj Vk/Qk Busy Op Vj/Qj Vk/Qk FP adder FP multipler Reg F0 F2 F4 F6 F8 F10 F12 F14 F16 F18 F20 Qj
Power Wall ~125W CPU near limit for “air cooled” Water cooled http://www-03.ibm.com/press/us/en/pressrelease/32049.wss