Lecture 12 Reorder Buffers

Slides:

Advertisements

Similar presentations

Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.

Advertisements

Instruction-level Parallelism Compiler Perspectives on Code Movement dependencies are a property of code, whether or not it is a HW hazard depends on.

Lec18.1 Step by step for Dynamic Scheduling by reorder buffer Copyright by John Kubiatowicz (http.cs.berkeley.edu/~kubitron)

A scheme to overcome data hazards

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 Instruction-Level Parallelism and Its Exploitation Computer Architecture A Quantitative.

Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 7: Dynamic Scheduling and Branch Prediction * Jeremy R. Johnson Wed. Nov. 8, 2000.

Computer Architecture Lec 8 – Instruction Level Parallelism.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

DAP Spr.‘98 ©UCB 1 Lecture 6: ILP Techniques Contd. Laxmi N. Bhuyan CS 162 Spring 2003.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Cont. Computer Architecture.

1 Zvika Guz Slides modified from Prof. Dave Patterson, Prof. John Kubiatowicz, and Prof. Nancy Warter-Perez Out Of Order Execution.

1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.

1 Overcoming Control Hazards with Dynamic Scheduling & Speculation.

1 Chapter 2: ILP and Its Exploitation Review simple static pipeline ILP Overview Dynamic branch prediction Dynamic scheduling, out-of-order execution Hardware-based.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 Instruction-Level Parallelism and Its Exploitation Computer Architecture A Quantitative.

1 Lecture 5 Overview of Superscalar Techniques CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading: Textbook, Ch. 2.1 “Complexity-Effective.

1 Lecture 6 Tomasulo Algorithm CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading:Textbook 2.4, 2.5.

Lecture 9 Instruction Level Parallelism Topics Review Appendix C Dynamic Scheduling Scoreboarding Tomasulo Readings: Chapter 3 October 8, 2014 CSCE 513.

Professor Nigel Topham Director, Institute for Computing Systems Architecture School of Informatics Edinburgh University Informatics 3 Computer Architecture.

CS 5513 Computer Architecture Lecture 6 – Instruction Level Parallelism continued.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm 吳俊興高雄大學資訊工程學系 October 2004 EEF011 Computer Architecture 計算機結構.

Instruction-Level Parallelism and Its Dynamic Exploitation

IBM System 360. Common architecture for a set of machines

The University of Adelaide, School of Computer Science

/ Computer Architecture and Design

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

The University of Adelaide, School of Computer Science

COMP 740: Computer Architecture and Implementation

Approaches to exploiting Instruction Level Parallelism (ILP)

Out of Order Processors

Dynamic Scheduling and Speculation

Tomasulo Loop Example Loop: LD F0 0 R1 MULTD F4 F0 F2 SD F4 0 R1

CS203 – Advanced Computer Architecture

CS5100 Advanced Computer Architecture Hardware-Based Speculation

Lecture 10 Tomasulo’s Algorithm

CPSC 614 Computer Architecture Lec 5 – Instruction Level Parallelism

Chapter 3: ILP and Its Exploitation

Advantages of Dynamic Scheduling

The University of Adelaide, School of Computer Science

Tomasulo With Reorder buffer:

11/14/2018 CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenković, Electrical and Computer.

CMSC 611: Advanced Computer Architecture

A Dynamic Algorithm: Tomasulo’s

Out of Order Processors

CS203 – Advanced Computer Architecture

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

John Kubiatowicz (http.cs.berkeley.edu/~kubitron)

The University of Adelaide, School of Computer Science

CS 704 Advanced Computer Architecture

Lecture 11: Memory Data Flow Techniques

Adapted from the slides of Prof

The University of Adelaide, School of Computer Science

Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)

Advanced Computer Architecture

September 20, 2000 Prof. John Kubiatowicz

Larry Wittie Computer Science, StonyBrook University and ~lw

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

Tomasulo Organization

Reduction of Data Hazards Stalls with Dynamic Scheduling

CS5100 Advanced Computer Architecture Dynamic Scheduling

CPSC 614 Computer Architecture Lec 5 – Instruction Level Parallelism

Adapted from the slides of Prof

John Kubiatowicz (http.cs.berkeley.edu/~kubitron)

Chapter 3: ILP and Its Exploitation

September 20, 2000 Prof. John Kubiatowicz

Lecture 7 Dynamic Scheduling

Overcoming Control Hazards with Dynamic Scheduling & Speculation

Presentation transcript:

Lecture 12 Reorder Buffers CSCE 513 Computer Architecture Lecture 12 Reorder Buffers Topics Tomasulo’s Loop example Speculation Reorder Buffers Readings: October 16, 2017

Overview Last Time New References Control Hazards: Lecture 7 slides 27-32 Data Hazards Review Tomasulo Overview, examples New Tomasulo Overview, examples revisited Figures 2.10 right one, 2.11 Tomasulo’s Algorithm details fig 2.12 Tomasulo + ReOrder Buffer (ROB) fig 2.14, 2.15, 2.16 References Chapter 2 section 2.6 Test 1

The University of Adelaide, School of Computer Science 18 September 2018 Dynamic Scheduling Branch Prediction Dynamic scheduling implies: Out-of-order execution Out-of-order completion Creates the possibility for WAR and WAW hazards Tomasulo’s Approach Tracks when operands are available Introduces register renaming in hardware Minimizes WAW and WAR hazards Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

The University of Adelaide, School of Computer Science 18 September 2018 Register Renaming Branch Prediction Example: DIV.D F0,F2,F4 ADD.D F6,F0,F8 S.D F6,0(R1) SUB.D F8,F10,F14 MUL.D F6,F10,F8 + name dependence with F6 antidependence antidependence Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

Figure 2.9 Tomasulo CDB Register Renaming

The University of Adelaide, School of Computer Science 18 September 2018 Tomasulo’s Algorithm Branch Prediction Three Steps: Issue Get next instruction from FIFO queue If available RS, issue the instruction to the RS with operand values if available If operand values not available, stall the instruction Execute When operand becomes available, store it in any reservation stations waiting for it When all operands are ready, issue the instruction Loads and store maintained in program order through effective address No instruction allowed to initiate execution until all branches that proceed it in program order have completed Write result Write result on CDB into reservation stations and store buffers (Stores must wait until address and value are received) Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

Example (new and improved in 5th edition) The University of Adelaide, School of Computer Science 18 September 2018 Example (new and improved in 5th edition) Branch Prediction Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

Figure 3.8 3

Data-Flow graph

Figure 3.9.a Tomasulo Issue

Figure 3.9.b Tomasulo Execute

Figure 3.9.c Tomasulo Write Result

Tomasulo Loop Example Loop: L.D F0, 0(R1) MUL.D F4, F0, F2 S.D F4, 0(R1) DADDIU R1, R1, -8 BNE R1, R2, Loop Dynamic loop unrolling of floating/LD point operations

Observations on Tomasulo’s Alg Tomasulo designed for the IBM 360/91 http://www.columbia.edu/acis/history/36091.html Does not require compiler to do all of the work Changes to hardware do not require changes to compiler (adding another multiplier) Designed before caches, but OoOE really helps with cache misses Dynamic scheduling required for “speculation”

Figure 3.12 Tomasulo + ROB example

Figure 3.10 - Two active Iterations of loop

Reorder Buffers

Speculation Issue Execute Write result Commit

Koren’s Tools Again http://www.ecs.umass.edu/ece/koren/architecture/

Figure 2.15 Tomasulo + ROB example

Figure 2.16 Tomasulo + ROB example

Fig 2.17a Tomasulo+ROB Details

Fig 2.17b Tomasulo+ROB Execute

Fig 2.17c Tomasulo+ROB Write-result

Fig 2.17d Tomasulo+ROB Commit

Figure 2.18 Multiple Issue Approaches

Unrolling for VLIW For i=1,10000 x[i] = x[i]+ c Loop: L.D F0, 0(R1) ADD.D F4,F0,F2 S.D F4, 0(R1) DADDUI R1, R1, -8 BNE R1,R2, loop Registers for Load Sum F0 F4 F6 F8 F10 F12 F14 F16 F18 F20 F22 F24 F26 F28

Figure 2.19 VLIW

Advanced Techniques for Instruction Delivery and Speculation Increasing Instruction Fetch Bandwidth Branch Target Buffers

When is the Branch Target Address available? Fig ? Appendix A

Figure A.24 – getting the branch target quicker

When is the Branch Target Address available?

Pentium 4 (sec 2.10) Front end –decoder IA32 instructions micro-ops (uops) which are RISC-like 3 IA32 instructions can be decoded per cycle upto 6 uops Uops are executed using a out-of-order speculative pipeline (using reg. renaming instead of ROB) Pentium 3 required at least 11 cycles for an instruction to go from fetch to “retire” Pentinum 4 pipeline depth continued to increase 21 cycles allowing 1.5GHz 31 cycles allowing 3.2GHz

Figure 2-26 Pentium 4 (Prescott)

Figure 2-27 Pentium 4 (Prescott) .

Tomasulo + Re-Order Buffer (ROB) http://www.ecs.umass.edu/ece/koren/architecture Configuration Defaults except: F0  F15 in operands for the loads FU latencies: FP-Adder: 2 FP-Multiplier: 6 FP-Divider: 12 Load latency: 2 Start simulation, then Clock+1 to step through

From Memory FP adder FP multipler Dest Instr Value Ready ROB Example pp 108 From Memory Registers To Memory Dest Addr Dest Op Vj/Qj Vk/Qk Dest Op Vj/Qj Vk/Qk FP adder FP multipler Reg F0 F2 F4 F6 F8 F10 F12 F14 ROB Busy

From Memory FP adder FP multipler Dest Instr Value Ready ROB Example pp 108 From Memory Registers To Memory Dest Addr Dest Op Vj/Qj Vk/Qk Dest Op Vj/Qj Vk/Qk FP adder FP multipler Reg F0 F2 F4 F6 F8 F10 F12 F14 ROB Busy

From Memory FP adder FP multipler Dest Instr Value Ready ROB Example pp 108 From Memory Registers To Memory Dest Addr Dest Op Vj/Qj Vk/Qk Dest Op Vj/Qj Vk/Qk FP adder FP multipler Reg F0 F2 F4 F6 F8 F10 F12 F14 ROB Busy

From Memory FP adder FP multipler Dest Instr Value Ready ROB Example pp 108 From Memory Registers To Memory Dest Addr Dest Op Vj/Qj Vk/Qk Dest Op Vj/Qj Vk/Qk FP adder FP multipler Reg F0 F2 F4 F6 F8 F10 F12 F14 ROB Busy

From Memory FP adder FP multipler Dest Instr Value Ready ROB Example pp 108 From Memory Registers To Memory Dest Addr Dest Op Vj/Qj Vk/Qk Dest Op Vj/Qj Vk/Qk FP adder FP multipler Reg F0 F2 F4 F6 F8 F10 F12 F14 ROB Busy

From Memory FP adder FP multipler Dest Instr Value Ready ROB Example pp 108 From Memory Registers To Memory Dest Addr Dest Op Vj/Qj Vk/Qk Dest Op Vj/Qj Vk/Qk FP adder FP multipler Reg F0 F2 F4 F6 F8 F10 F12 F14 ROB Busy

From Memory FP adder FP multipler Dest Instr Value Ready ROB Example pp 108 From Memory Registers To Memory Dest Addr Dest Op Vj/Qj Vk/Qk Dest Op Vj/Qj Vk/Qk FP adder FP multipler Reg F0 F2 F4 F6 F8 F10 F12 F14 ROB Busy

From Memory FP adder FP multipler Dest Instr Value Ready ROB Example pp 108 From Memory Registers To Memory Dest Addr Dest Op Vj/Qj Vk/Qk Dest Op Vj/Qj Vk/Qk FP adder FP multipler Reg F0 F2 F4 F6 F8 F10 F12 F14 ROB Busy

From Memory FP adder FP multipler Dest Instr Value Ready ROB Example pp 108 From Memory Registers To Memory Dest Addr Dest Op Vj/Qj Vk/Qk Dest Op Vj/Qj Vk/Qk FP adder FP multipler Reg F0 F2 F4 F6 F8 F10 F12 F14 ROB Busy

From Memory FP adder FP multipler Dest Instr Value Ready ROB Example pp 108 From Memory Registers To Memory Dest Addr Dest Op Vj/Qj Vk/Qk Dest Op Vj/Qj Vk/Qk FP adder FP multipler Reg F0 F2 F4 F6 F8 F10 F12 F14 ROB Busy

From Memory FP adder FP multipler Dest Instr Value Ready ROB Example pp 108 From Memory Registers To Memory Dest Addr Dest Op Vj/Qj Vk/Qk Dest Op Vj/Qj Vk/Qk FP adder FP multipler Reg F0 F2 F4 F6 F8 F10 F12 F14 ROB Busy

Tomasulo Example Page 98

Memory FP adder FP multipler Tomasulo’s Example pp 98 Instruction Issue Execute WriteResult L.D F6, 32(R2) MUL.D F0, F2, F4 Cycle Memory Dest Addr Busy Op Vj/Qj Vk/Qk Busy Op Vj/Qj Vk/Qk FP adder FP multipler Reg F0 F2 F4 F6 F8 F10 F12 F14 F16 F18 F20 Qj

Power Wall ~125W CPU near limit for “air cooled” Water cooled http://www-03.ibm.com/press/us/en/pressrelease/32049.wss