1 Lecture: Out-of-order Processors Topics: branch predictor wrap-up, a basic out-of-order processor with issue queue, register renaming, and reorder buffer.

Slides:



Advertisements
Similar presentations
Lecture 19: Cache Basics Today’s topics: Out-of-order execution
Advertisements

Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.
1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.
1 Lecture 20: Speculation Papers: Is SC+ILP=RC?, Purdue, ISCA’99 Coherence Decoupling: Making Use of Incoherence, Wisconsin, ASPLOS’04 Selective, Accurate,
THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
1 Lecture: Branch Prediction Topics: branch prediction, bimodal/global/local/tournament predictors, branch target buffer (Section 3.3, notes on class webpage)
1 Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)
1 Lecture 7: Static ILP, Branch prediction Topics: static ILP wrap-up, bimodal, global, local branch prediction (Sections )
1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
1 Lecture 8: Branch Prediction, Dynamic ILP Topics: branch prediction, out-of-order processors (Sections )
1 Lecture 2: System Metrics and Pipelining Today’s topics: (Sections 1.6, 1.7, 1.9, A.1)  Quantitative principles of computer design  Measuring cost.
1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.
1 Lecture 8: Branch Prediction, Dynamic ILP Topics: branch prediction, out-of-order processors (Sections )
1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )
March 9, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Krste Asanovic Electrical.
1 Lecture 11: Digital Design Today’s topics:  Evaluating a system  Intro to boolean functions.
1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )
1 Lecture 18: Pipelining Today’s topics:  Hazards and instruction scheduling  Branch prediction  Out-of-order execution Reminder:  Assignment 7 will.
1 Lecture 10: ILP Innovations Today: handling memory dependences with the LSQ and innovations for each pipeline stage (Section 3.5)
1 Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections )
1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )
1 Lecture 7: Branch prediction Topics: bimodal, global, local branch prediction (Sections )
1 Lecture 7: Static ILP and branch prediction Topics: static speculation and branch prediction (Appendix G, Section 2.3)
1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.
OOO Pipelines - II Smruti R. Sarangi IIT Delhi 1.
OOO Pipelines - III Smruti R. Sarangi Computer Science and Engineering, IIT Delhi.
1 Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)
1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.
1 Lecture 3: Pipelining Basics Today: chapter 1 wrap-up, basic pipelining implementation (Sections C.1 - C.4) Reminders:  Sign up for the class mailing.
CS203 – Advanced Computer Architecture ILP and Speculation.
1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.
Lecture: Out-of-order Processors
Lecture: Branch Prediction
Lecture: Out-of-order Processors
Lecture 6: Advanced Pipelines
Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Lecture 10: Out-of-order Processors
Lecture 11: Out-of-order Processors
Lecture: Out-of-order Processors
Lecture 19: Branches, OOO Today’s topics: Instruction scheduling
Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Lecture 18: Pipelining Today’s topics:
Lecture: Static ILP, Branch Prediction
Lecture: SMT, Cache Hierarchies
Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
Lecture 18: Pipelining Today’s topics:
Lecture: Branch Prediction
Lecture: Out-of-order Processors
Lecture 8: Dynamic ILP Topics: out-of-order processors
Adapted from the slides of Prof
Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
Lecture 19: Branches, OOO Today’s topics: Instruction scheduling
Lecture: Pipelining Extensions
Lecture 20: OOO, Memory Hierarchy
Lecture 20: OOO, Memory Hierarchy
Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
Adapted from the slides of Prof
Lecture: SMT, Cache Hierarchies
Dynamic Hardware Prediction
Lecture 10: ILP Innovations
Lecture 9: ILP Innovations
Lecture 9: Dynamic ILP Topics: out-of-order processors
Conceptual execution on a processor which exploits ILP
Lecture 7: Branch Prediction, Dynamic ILP
Presentation transcript:

1 Lecture: Out-of-order Processors Topics: branch predictor wrap-up, a basic out-of-order processor with issue queue, register renaming, and reorder buffer

2 Amdahl’s Law Architecture design is very bottleneck-driven – make the common case fast, do not waste resources on a component that has little impact on overall performance/power Amdahl’s Law: performance improvements through an enhancement is limited by the fraction of time the enhancement comes into play Example: a web server spends 40% of time in the CPU and 60% of time doing I/O – a new processor that is ten times faster results in a 36% reduction in execution time (speedup of 1.56) – Amdahl’s Law states that maximum execution time reduction is 40% (max speedup of 1.66)

3 Principle of Locality Most programs are predictable in terms of instructions executed and data accessed The Rule: a program spends 90% of its execution time in only 10% of the code Temporal locality: a program will shortly re-visit X Spatial locality: a program will shortly visit X+1

4 Problem 1 What is the storage requirement for a global predictor that uses 3-bit saturating counters and that produces an index by XOR-ing 12 bits of branch PC with 12 bits of global history?

5 Problem 1 What is the storage requirement for a global predictor that uses 3-bit saturating counters and that produces an index by XOR-ing 12 bits of branch PC with 12 bits of global history? The index is 12 bits wide, so the table has 2^12 saturating counters. Each counter is 3 bits wide. So total storage = 3 * 4096 = 12 Kb or 1.5 KB

6 Problem 2 What is the storage requirement for a tournament predictor that uses the following structures:  a “selector” that has 4K entries and 2-bit counters  a “global” predictor that XORs 14 bits of branch PC with 14 bits of global history and uses 3-bit counters  a “local” predictor that uses an 8-bit index into L1, and produces a 12-bit index into L2 by XOR-ing branch PC and local history. The L2 uses 2-bit counters.

7 Problem 2 What is the storage requirement for a tournament predictor that uses the following structures:  a “selector” that has 4K entries and 2-bit counters  a “global” predictor that XORs 14 bits of branch PC with 14 bits of global history and uses 3-bit counters  a “local” predictor that uses an 8-bit index into L1, and produces a 12-bit index into L2 by XOR-ing branch PC and local history. The L2 uses 2-bit counters. Selector = 4K * 2b = 8 Kb Global = 3b * 2^14 = 48 Kb Local = (12b * 2^8) + (2b * 2^12) = 3 Kb + 8 Kb = 11 Kb Total = 67 Kb

8 Problem 3 For the code snippet below, estimate the steady-state bpred accuracies for the default PC+4 prediction, the 1-bit bimodal, 2-bit bimodal, global, and local predictors. Assume that the global/local preds use 5-bit histories. do { for (i=0; i<4; i++) { increment something } for (j=0; j<8; j++) { increment something } k++; } while (k < some large number)

9 Problem 3 For the code snippet below, estimate the steady-state bpred accuracies for the default PC+4 prediction, the 1-bit bimodal, 2-bit bimodal, global, and local predictors. Assume that the global/local preds use 5-bit histories. do { for (i=0; i<4; i++) { increment something } for (j=0; j<8; j++) { increment something } k++; } while (k < some large number) PC+4: 2/13 = 15% 1b Bim: (2+6+1)/(4+8+1) = 9/13 = 69% 2b Bim: (3+7+1)/13 = 11/13 = 85% Global: (4+7+1)/13 = 12/13 = 92% (gets confused by unless you take branch-PC into account while indexing) Local: (4+7+1)/13 = 12/13 = 92%

10 An Out-of-Order Processor Implementation Branch prediction and instr fetch R1  R1+R2 R2  R1+R3 BEQZ R2 R3  R1+R2 R1  R3+R2 Instr Fetch Queue Decode & Rename Instr 1 Instr 2 Instr 3 Instr 4 Instr 5 Instr 6 T1 T2 T3 T4 T5 T6 Reorder Buffer (ROB) T1  R1+R2 T2  T1+R3 BEQZ T2 T4  T1+T2 T5  T4+T2 Issue Queue (IQ) ALU Register File R1-R32 Results written to ROB and tags broadcast to IQ

11 Problem 1 Show the renamed version of the following code: Assume that you have 4 rename registers T1-T4 R1  R2+R3 R3  R4+R5 BEQZ R1 R1  R1 + R3 R3  R1 + R3

12 Problem 1 Show the renamed version of the following code: Assume that you have 4 rename registers T1-T4 R1  R2+R3 T1  R2+R3 R3  R4+R5 T2  R4+R5 BEQZ R1 BEQZ T1 R1  R1 + R3 T4  T1+T2 R1  R1 + R3 T1  T4+T2 R3  R1 + R3 T2  T1 +R3

13 Design Details - I Instructions enter the pipeline in order No need for branch delay slots if prediction happens in time Instructions leave the pipeline in order – all instructions that enter also get placed in the ROB – the process of an instruction leaving the ROB (in order) is called commit – an instruction commits only if it and all instructions before it have completed successfully (without an exception) To preserve precise exceptions, a result is written into the register file only when the instruction commits – until then, the result is saved in a temporary register in the ROB

14 Design Details - II Instructions get renamed and placed in the issue queue – some operands are available (T1-T6; R1-R32), while others are being produced by instructions in flight (T1-T6) As instructions finish, they write results into the ROB (T1-T6) and broadcast the operand tag (T1-T6) to the issue queue – instructions now know if their operands are ready When a ready instruction issues, it reads its operands from T1-T6 and R1-R32 and executes (out-of-order execution) Can you have WAW or WAR hazards? By using more names (T1-T6), name dependences can be avoided

15 Design Details - III If instr-3 raises an exception, wait until it reaches the top of the ROB – at this point, R1-R32 contain results for all instructions up to instr-3 – save registers, save PC of instr-3, and service the exception If branch is a mispredict, flush all instructions after the branch and start on the correct path – mispredicted instrs will not have updated registers (the branch cannot commit until it has completed and the flush happens as soon as the branch completes) Potential problems: ?

16 Managing Register Names Logical Registers R1-R32 Physical Registers P1-P64 R1  R1+R2 R2  R1+R3 BEQZ R2 R3  R1+R2 P33  P1+P2 P34  P33+P3 BEQZ P34 P35  P33+P34 At the start, R1-R32 can be found in P1-P32 Instructions stop entering the pipeline when P64 is assigned What happens on commit? Temporary values are stored in the register file and not the ROB

17 The Commit Process On commit, no copy is required The register map table is updated – the “committed” value of R1 is now in P33 and not P1 – on an exception, P33 is copied to memory and not P1 An instruction in the issue queue need not modify its input operand when the producer commits When instruction-1 commits, we no longer have any use for P1 – it is put in a free pool and a new instruction can now enter the pipeline  for every instr that commits, a new instr can enter the pipeline  number of in-flight instrs is a constant = number of extra (rename) registers

18 The Alpha Out-of-Order Implementation Branch prediction and instr fetch R1  R1+R2 R2  R1+R3 BEQZ R2 R3  R1+R2 R1  R3+R2 Instr Fetch Queue Decode & Rename Instr 1 Instr 2 Instr 3 Instr 4 Instr 5 Instr 6 Reorder Buffer (ROB) P33  P1+P2 P34  P33+P3 BEQZ P34 P35  P33+P34 P36  P35+P34 Issue Queue (IQ) ALU Register File P1-P64 Results written to regfile and tags broadcast to IQ Speculative Reg Map R1  P36 R2  P34 Committed Reg Map R1  P1 R2  P2

19 Problem 2 Show the renamed version of the following code: Assume that you have 36 physical registers and 32 architected registers R1  R2+R3 R3  R4+R5 BEQZ R1 R1  R1 + R3 R3  R1 + R3 R4  R3 + R1

20 Problem 2 Show the renamed version of the following code: Assume that you have 36 physical registers and 32 architected registers R1  R2+R3 P33  P2+P3 R3  R4+R5 P34  P4+P5 BEQZ R1 BEQZ P33 R1  R1 + R3 P35  P33+P34 R1  R1 + R3 P36  P35+P34 R3  R1 + R3 P1  P36+P34 R4  R3 + R1 P3  P1+P36

21 Title Bullet