Advanced Computer Architecture

Slides:



Advertisements
Similar presentations
Instruction-level Parallelism Compiler Perspectives on Code Movement dependencies are a property of code, whether or not it is a HW hazard depends on.
Advertisements

A scheme to overcome data hazards
Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
Dyn. Sched. CSE 471 Autumn 0219 Tomasulo’s algorithm “Weaknesses” in scoreboard: –Centralized control –No forwarding (more RAW than needed) Tomasulo’s.
COMP25212 Advanced Pipelining Out of Order Processors.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 7: Dynamic Scheduling and Branch Prediction * Jeremy R. Johnson Wed. Nov. 8, 2000.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
1 IF IDEX MEM L.D F4,0(R2) MUL.D F0, F4, F6 ADD.D F2, F0, F8 L.D F2, 0(R2) WB IF IDM1 MEM WBM2M3M4M5M6M7 stall.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Cont. Computer Architecture.
EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.
1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.
EENG449b/Savvides Lec /25/05 March 24, 2005 Prof. Andreas Savvides Spring g449b EENG 449bG/CPSC 439bG.
1 Dynamic Branch Prediction. 2 Why do we want to predict branches? MIPS based pipeline – 1 instruction issued per cycle, branch hazard of 1 cycle. –Delayed.
1 Lecture 5 Overview of Superscalar Techniques CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading: Textbook, Ch. 2.1 “Complexity-Effective.
1 Lecture 6 Tomasulo Algorithm CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading:Textbook 2.4, 2.5.
Professor Nigel Topham Director, Institute for Computing Systems Architecture School of Informatics Edinburgh University Informatics 3 Computer Architecture.
1 Lecture 5: Dependence Analysis and Superscalar Techniques Overview Instruction dependences, correctness, inst scheduling examples, renaming, speculation,
2/24; 3/1,3/11 (quiz was 2/22, QuizAns 3/8) CSE502-S11, Lec ILP 1 Tomasulo Organization FP adders Add1 Add2 Add3 FP multipliers Mult1 Mult2 From.
04/03/2016 slide 1 Dynamic instruction scheduling Key idea: allow subsequent independent instructions to proceed DIVDF0,F2,F4; takes long time ADDDF10,F0,F8;
Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm 吳俊興 高雄大學資訊工程學系 October 2004 EEF011 Computer Architecture 計算機結構.
Instruction-Level Parallelism and Its Dynamic Exploitation
IBM System 360. Common architecture for a set of machines
CS203 – Advanced Computer Architecture
Dynamic Branch Prediction
The University of Adelaide, School of Computer Science
/ Computer Architecture and Design
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
CS 704 Advanced Computer Architecture
Instruction-Level Parallelism (ILP)
Approaches to exploiting Instruction Level Parallelism (ILP)
Out of Order Processors
Dynamic Scheduling and Speculation
Step by step for Tomasulo Scheme
CS203 – Advanced Computer Architecture
Lecture 6 Score Board And Tomasulo’s Algorithm
Lecture 10 Tomasulo’s Algorithm
Lecture 12 Reorder Buffers
Chapter 3: ILP and Its Exploitation
Advantages of Dynamic Scheduling
Instruction-level Parallelism
High-level view Out-of-order pipeline
CSCE 430/830 Computer Architecture Instruction Level Parallelism
CMSC 611: Advanced Computer Architecture
A Dynamic Algorithm: Tomasulo’s
Out of Order Processors
John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
The University of Adelaide, School of Computer Science
CS 704 Advanced Computer Architecture
CS 704 Advanced Computer Architecture
Adapted from the slides of Prof
The University of Adelaide, School of Computer Science
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
Loop Unrolling Determine loop unrolling useful by finding that loop iterations were independent Determine address offsets for different loads/stores Increases.
Static vs. dynamic scheduling
Adapted from the slides of Prof
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Tomasulo Organization
Reduction of Data Hazards Stalls with Dynamic Scheduling
CS5100 Advanced Computer Architecture Dynamic Scheduling
Adapted from the slides of Prof
Midterm 2 review Chapter
/ Computer Architecture and Design
September 20, 2000 Prof. John Kubiatowicz
Dynamic Hardware Prediction
High-level view Out-of-order pipeline
Lecture 7 Dynamic Scheduling
Conceptual execution on a processor which exploits ILP
Presentation transcript:

Advanced Computer Architecture Dynamic Instruction Level Parallelism

Going Faster – Dynamic Methods Dynamic Scheduling Branch Prediction Multiple Issue Speculation First Class Second Class

Go Faster – Execute Multiple Instructions Five instructions execute at once

Limits to Performance why instructions aren’t parallel Structural Hazards Insufficient resources Data Hazards Data values Name Control Hazards

Data Hazards RAW WAR WAW LD R2, 0(R7) ADD R1, R2, R3 DIV R6, R1, R2 SUB R2, R5, R6 OR R6, R5, R4 RAW WAR WAW

Pipeline blocking LD R2, 0(R7) ADD R1, R2, R3 DIV R6, R1, R2 SUB R8, R5, R4 SW R8, 4(R7) Instead of stalling, we could let SUB and SW execute

Issues with simple pipeline In-order issue, execute, completion Problem for machines with Long memory latency Slow instructions Limited number of registers Can we get around these issues?

Dynamic Scheduling Tomasulo’s algorithm Change assumptions IBM 360/91 had limits No cache Slow FP operations Only 4 double FP registers Change assumptions Out of order completion Reservation stations

Processor characteristics Multiple function units Multiply/divide won’t block add/sub Long pipelines Hide latency of long operations Reservation stations Issue only when data is ready Effectively adds extra registers

Observations Some code sequences are independent Order of execution may not effect result Registers have two uses Fast access to program variables Temporaries Registers not needed after they are dead

Live vs. Dead LD R2, 0(R7) ADD R1, R2, R3 DIV R6, R1, R2 SUB R2, R5, R6 OR R6, R5, R4 R2 Live R6 Live R2 Dead R6 Dead

Use temp registers LD S, 0(R7) ADD R1, S, R3 DIV T, R1, S SUB R2, R5, T OR R6, R5, R4

Implementation Separate Pipeline into phases Instruction Issue Execute Write result Queue instructions at Execution Units until ready to run, with temp registers If data not available, hold pointer to data Issue in program order within function units Single Writeback bus (CDB) Analogous to bypass paths Local detect and update

Reservation Stations Each reservation station has: OP – operation Qj, Qk – address of source operands j and k Vj, Vk – value of source operands j and k A – memory address for loads/stores Busy – indicates station is occupied Each architectural register has Qi – address of reservation station that will write this register

Tomasulo FP Unit

Example L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0,F2,F4 SUB.D F8,F2,F6 DIV.D F10,F0,F6 ADD.D F6,F8,F2

Execution Name Busy Op Vj Vk Qj Qk A Instruction I E W L.D F6,34(R2) MUL.D F0,F2,F4 SUB.D F8,F2,F6 DIV.D F10,F0,F6 ADD.D F6,F8,F2 Register Qi F0 F2 F4 F6 F8 F10 Name Busy Op Vj Vk Qj Qk A Load1 no Load2 Add1 Add2 Add3 Mult1 Mult2

Execution Name Busy Op Vj Vk Qj Qk A Instruction I E W L.D F6,34(R2)  MUL.D F0,F2,F4 SUB.D F8,F2,F6 DIV.D F10,F0,F6 ADD.D F6,F8,F2 Register Qi F0 F2 F4 F6 Load1 F8 F10 Name Busy Op Vj Vk Qj Qk A Load1 yes Load 34(R2) Load2 no Add1 Add2 Add3 Mult1 Mult2

Execution Name Busy Op Vj Vk Qj Qk A Instruction I E W L.D F6,34(R2)  MUL.D F0,F2,F4 SUB.D F8,F2,F6 DIV.D F10,F0,F6 ADD.D F6,F8,F2 Register Qi F0 F2 Load2 F4 F6 Load1 F8 F10 Name Busy Op Vj Vk Qj Qk A Load1 yes Load 34(R2) Load2 45(R3) Add1 no Add2 Add3 Mult1 Mult2

Execution Name Busy Op Vj Vk Qj Qk A Instruction I E W L.D F6,34(R2)  MUL.D F0,F2,F4 SUB.D F8,F2,F6 DIV.D F10,F0,F6 ADD.D F6,F8,F2 Register Qi F0 Mult1 F2 Load2 F4 F6 Load1 F8 F10 Name Busy Op Vj Vk Qj Qk A Load1 yes Load 34(R2) Load2 45(R3) Add1 no Add2 Add3 Mult1 MUL Regs[F4] Mult2

Execution Name Busy Op Vj Vk Qj Qk A Instruction I E W L.D F6,34(R2)  MUL.D F0,F2,F4 SUB.D F8,F2,F6 DIV.D F10,F0,F6 ADD.D F6,F8,F2 Register Qi F0 Mult1 F2 Load2 F4 F6 Load1 F8 Add1 F10 Name Busy Op Vj Vk Qj Qk A Load1 yes Load 34(R2) Load2 45(R3) Add1 Yes SUB Add2 no Add3 Mult1 MUL Regs[F4] Mult2

Execution Name Busy Op Vj Vk Qj Qk A Instruction I E W L.D F6,34(R2)  MUL.D F0,F2,F4 SUB.D F8,F2,F6 DIV.D F10,F0,F6 ADD.D F6,F8,F2 Register Qi F0 Mult1 F2 Load2 F4 F6 Load1 F8 Add1 F10 Mult2 Name Busy Op Vj Vk Qj Qk A Load1 yes Load 34(R2) Load2 45(R3) Add1 SUB Add2 no Add3 Mult1 MUL Regs[F4] Mult2 DIV

Execution Name Busy Op Vj Vk Qj Qk A Instruction I E W L.D F6,34(R2)  MUL.D F0,F2,F4 SUB.D F8,F2,F6 DIV.D F10,F0,F6 ADD.D F6,F8,F2 Register Qi F0 Mult1 F2 Load2 F4 F6 Add2 F8 Add1 F10 Mult2 Name Busy Op Vj Vk Qj Qk A Load1 no Load2 yes Load 45(R3) Add1 SUB Mem[1] Add2 ADD Add3 Mult1 MUL Regs[F4] Mult2 DIV

Execution Name Busy Op Vj Vk Qj Qk A Instruction I E W L.D F6,34(R2)  MUL.D F0,F2,F4 SUB.D F8,F2,F6 DIV.D F10,F0,F6 ADD.D F6,F8,F2 Register Qi F0 Mult1 F2 F4 F6 Add2 F8 Add1 F10 Mult2 Name Busy Op Vj Vk Qj Qk A Load1 no Load2 Add1 yes SUB Mem[2] Mem[1] Add2 ADD Add3 Mult1 MUL Regs[F4] Mult2 DIV

Execution Name Busy Op Vj Vk Qj Qk A Instruction I E W L.D F6,34(R2)  MUL.D F0,F2,F4 SUB.D F8,F2,F6 DIV.D F10,F0,F6 ADD.D F6,F8,F2 Register Qi F0 Mult1 F2 F4 F6 Add2 F8 Add1 F10 Mult2 Name Busy Op Vj Vk Qj Qk A Load1 No Load2 Add1 no Add2 yes ADD Mem[2] Add3 Mult1 Mult2 DIV Mem[1]

Execution Name Busy Op Vj Vk Qj Qk A Instruction I E W L.D F6,34(R2)  MUL.D F0,F2,F4 SUB.D F8,F2,F6 DIV.D F10,F0,F6 ADD.D F6,F8,F2 Register Qi F0 F2 F4 F6 Add2 F8 Add1 F10 Mult2 Name Busy Op Vj Vk Qj Qk A Load1 no Load2 Add1 Add2 yes ADD Mem[2] Add3 Mult1 Mult2 DIV F0 Mem[1]

Execution Name Busy Op Vj Vk Qj Qk A Instruction I E W L.D F6,34(R2)  MUL.D F0,F2,F4 SUB.D F8,F2,F6 DIV.D F10,F0,F6 ADD.D F6,F8,F2 Register Qi F0 F2 F4 F6 Add2 F8 F10 Mult2 Name Busy Op Vj Vk Qj Qk A Load1 no Load2 Add1 Add2 yes ADD F8 Mem[2] Add3 Mult1 Mult2

Execution Name Busy Op Vj Vk Qj Qk A Instruction I E W L.D F6,34(R2)  MUL.D F0,F2,F4 SUB.D F8,F2,F6 DIV.D F10,F0,F6 ADD.D F6,F8,F2 Register Qi F0 F2 F4 F6 Add2 F8 F10 Name Busy Op Vj Vk Qj Qk A Load1 no Load2 Add1 Add2 Add3 Mult1 Mult2

Execution Name Busy Op Vj Vk Qj Qk A Instruction I E W L.D F6,34(R2)  MUL.D F0,F2,F4 SUB.D F8,F2,F6 DIV.D F10,F0,F6 ADD.D F6,F8,F2 Register Qi F0 F2 F4 F6 F8 F10 Name Busy Op Vj Vk Qj Qk A Load1 no Load2 Add1 Add2 Add3 Mult1 Mult2

Issues for Tomasulo Hardware complexity Only 1 Common Data Bus Complex control logic Associative lookup at each reservation station Only 1 Common Data Bus Imprecise interrupts Instructions issue in order, but do not necessarily complete in order Control flow Branches cause stall

Cost of branches Pipeline flushes Redirected instruction fetch Instructions on wrong path Redirected instruction fetch Target address not prefetched Frequency Branches are 20%-25% of instructions executed Can we reduce branch penalties?

Loops for(i=0;i<100;i++) { … } This branch is usually taken

Branch prediction Static prediction Software hint Add state E.g., predict not taken Software hint Add state 1 bit that which direction branch went last time it was executed Note: wrong twice per loop!

Improving branch prediction 2-bit saturating counter

Implementing branch prediction Use a “cache” of counters indexed by PC Doesn’t matter that the counter at address might be for a different branch (it’s only an optimization) Increase size of counters After 2 bits, it doesn’t much matter Make the number of entries larger After a while (4K, 8K?) it doesn’t much matter Note: Performance is a function of prediction accuracy and branch frequency

Prediction accuracy

Correlating branch predictors if(aa==2) aa=0; if(bb=2) bb=0; if(aa==bb) { DSUBUI R3,R1,#2 BNEZ R3,L1 DADD R1,R0,R0 L1: DSUBUI R3,R2,#2 BNEZ R3,L2 DADD R2,R0,R0 L2: DSUBU R3,R1,R1 BEQZ R3,L3 This branch is correlated with the previous branches

Implementation Record the direction of the last m branches Generates an m-bit bit vector Use the m-bit vector to index into a table of 2m k-bit predictors Called a global predictor Previous example is called a local predictor

Generalization (m,n) predictors m bit global vector selects among 2m local predictor tables Each local predictor table entry is an n bit counter It still doesn’t matter whether the table entry really matches the current branch

2-level Branch Prediction Buffer

Impact of 2-level prediction 8K bits branch prediction table equivalent to infinite table 8K bits of branch prediction table organized (2,2) is up to twice as effective

Tournament Predictor Use state counter to select which predictor to use E.g., 2-bit state counter to switch between a local and a global branch predictor

Speeding up instruction fetch after branches Branch Target Buffer Cache that stores predicted address of the instruction that follows a branch Functions just like a cache i.e., unlike Branch Prediction Table, the target has to be for the correct branch Extension: store the instruction at the target address, not just its address

BTB Example