15-740/18-740 Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.

Slides:

Advertisements

Similar presentations

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.

Advertisements

1 Lecture 11: Modern Superscalar Processor Models Generic Superscalar Models, Issue Queue-based Pipeline, Multiple-Issue Design.

CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.

A scheme to overcome data hazards

Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Microprocessor Microarchitecture Dependency and OOO Execution Lynn Choi Dept. Of Computer and Electronics Engineering.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 14, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)

Register Renaming & Value Prediction. Overview ► Need for Post-RISC ► Register Renaming vs. Allocation Strategies ► How to compile for Post-RISC machines.

DATAFLOW ARHITEKTURE. Dataflow Processors - Motivation In basic processor pipelining hazards limit performance –Structural hazards –Data hazards due to.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.

1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.

The PowerPC Architecture  IBM, Motorola, and Apple Alliance  Based on the IBM POWER Architecture Facilitate parallel execution Scale well with advancing.

Lecture 8 Shelving in Superscalar Processors (Part 1)

CPE731: Advanced Computer Architecture Course Introduction Dr. Gheith Abandah د. غيث علي عبندة.

1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement.

Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.

© Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture Instruction Execution: Dynamic Scheduling.

Computer Architecture: Out-of-Order Execution

1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.

Lecture 17 Final Review Prof. Mike Schulte Computer Architecture ECE 201.

Computer Architecture: Out-of-Order Execution II

Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.

Samira Khan University of Virginia Feb 9, 2016 COMPUTER ARCHITECTURE CS 6354 Precise Exception The content and concept of this course are adapted from.

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University.

Computer Architecture Lecture 7: Microprogrammed Microarchitectures Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 1/30/2013.

Microarchitecture of Superscalars (6) Register renaming Dezső Sima Spring 2008 (Ver. 2.0)  Dezső Sima, 2008.

CSE431 L13 SS Execute & Commit.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 13: SS Backend (Execute, Writeback & Commit) Mary Jane.

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

15-740/ Computer Architecture Lecture 7: Out-of-Order Execution Prof. Onur Mutlu Carnegie Mellon University.

CS 352H: Computer Systems Architecture

Precise Exceptions and Out-of-Order Execution

Computer Architecture Lecture 14: Out-of-Order Execution

Design of Digital Circuits Lecture 18: Out-of-Order Execution

15-740/ Computer Architecture Lecture 21: Superscalar Processing

Prof. Onur Mutlu Carnegie Mellon University

Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 2/16/2015

CIS-550 Advanced Computer Architecture Lecture 10: Precise Exceptions

Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 7: Pipelining

Out of Order Processors

Dynamic Scheduling and Speculation

CS203 – Advanced Computer Architecture

Microprocessor Microarchitecture Dynamic Pipeline

Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014

Flow Path Model of Superscalars

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

ECE 2162 Reorder Buffer.

Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

15-740/ Computer Architecture Lecture 5: Precise Exceptions

15-740/ Computer Architecture Lecture 10: Runahead and MLP

15-740/ Computer Architecture Lecture 14: Runahead Execution

7. Microarchitecture of Superscalars (5) Dynamic Instruction Issue

Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011

15-740/ Computer Architecture Lecture 14: Prefetching

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution

Prof. Onur Mutlu Carnegie Mellon University

Lecture 7 Dynamic Scheduling

Design of Digital Circuits Lecture 19a: VLIW

Conceptual execution on a processor which exploits ILP

Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/15/2013

Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 2/11/2015

Design of Digital Circuits Lecture 16: Out-of-Order Execution

Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 3/19/2012

Prof. Onur Mutlu Carnegie Mellon University

Presentation transcript:

15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011

Reviews Due next Monday  Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors,” HPCA  Mutlu et al., “Efficient Runahead Execution: Power-Efficient Memory Latency Tolerance,” IEEE Micro Top Picks Due next Wednesday Chrysos and Emer, “Memory Dependence Prediction Using Store Sets,” ISCA

Announcements Milestone I  Deadline postponed to next Friday (Oct 14)  Format: 2-pages  Include results from your initial evaluations. We need to see good progress.  Of course, the results need to make sense (i.e., you should be able to explain them) Midterm I  October 21 3

Last Lecture Three pieces of dynamic information that make static scheduling difficult Example of out of order execution  Timeline  How things work in a design that uses a consolidated physical register file 4

Today Issues in OoO Execution 5

Review: Summary of OOO Execution Concepts Renaming eliminates false dependencies Tag broadcast enables value communication between instructions  dataflow An out-of-order engine dynamically builds the dataflow graph of a piece of the program  which piece? Limited to the instruction window  Can we do it for the whole program? Why would we like to? How can we have a large instruction window efficiently? 6

Review: An Exercise Assume ADD (4 cycle execute), MUL (6 cycle execute) Assume one adder and one multiplier How many cycles  in a non-pipelined machine  in an in-order-dispatch pipelined machine with future file and reorder buffer  in an out-of-order dispatch pipelined machine with future file and reorder buffer 7 MUL R3  R1, R2 ADD R5  R3, R4 ADD R7  R2, R6 ADD R10  R8, R9 MUL R11  R7, R10 ADD R5  R5, R11 FD E R W

Some Implementation Issues Back to back dispatch of dependent operations  When should the tag be broadcast? Is it a good idea to have the reservation stations also be the reorder buffer? Is it a good idea to have values in the register alias table? Is it a good idea to store values in reservation stations? Is it a good idea to have a single, centralized reservation stations for all FUs? Idea: Decoupling different buffers  Many modern processors decouple these buffers  Many have distributed reservation stations across FUs 8

Intel Pentium Pro 9 Figure courtesy of Prof. Wen-mei Hwu, University of Illinois

Buffer Decoupling in Intel Pentium 4 10

Pentium Pro vs. Pentium 4 11 Pointers to architectural register values

Required/Recommended Reading Hinton et al., “The Microarchitecture of the Pentium 4 Processor,” Intel Technology Journal, Kessler, “The Alpha Microprocessor,” IEEE Micro, March-April Yeager, “The MIPS R10000 Superscalar Microprocessor,” IEEE Micro, April Smith and Sohi, “The Microarchitecture of Superscalar Processors,” Proc. IEEE, Dec

Buffer-Decoupled OoO Designs Register Alias Table contains only tags and valid bits  Tag is a pointer to the physical register containing the value During renaming, instruction allocates entries in different buffers (as needed):  Reorder buffer  Physical register file  Reservation stations (centralized vs. distributed)  Store/load buffer Tradeoffs? 13

Physical Register Files Register values stored only in physical register files (architectural register state is part of this file), e.g. Pentium 4 + Smaller physical register file than ROB: Not all instructions write to registers  more area efficient + No duplication of data values in reservation stations, arch register file, reorder buffer, rename table (space efficient) + Only one write and read path for register data values (simpler control for register read and writes) -- Accessing register state always requires indirection (higher latency) - Instructions with ready operands need to access register file - Copying architectural register state takes time -- Register file size increases. Centralization limits scalability. 14

Physical Register Files in Pentium 4 Hinton et al., “The Microarchitecture of the Pentium 4 Processor,” Intel Technology Journal,

Physical Register Files in Alpha Kessler, “The Alpha Microprocessor,” IEEE Micro, March-April

Alpha Pipelines 17 Source: Microprocessor Report, 10/28/96

Centralized Reservation Stations (Pentium Pro) 18 IF ID/ Reg Add Br FPU Mul LSU WB FPU Reservation Stations

Distributed Reservation Stations Many processors: PowerPC 604, Pentium 4 19 IF ID/ Reg Add Br FPU Mul LSU WB FPU

Centralized vs. Distributed Reservation Stations Centralized (monolithic): + Reservation stations not statically partitioned (can adapt to changes in instruction mix, e.g., 100% adds) -- All entries need to have all fields even though some fields might not be needed for some instructions (e.g. branches, stores, etc) -- Number of ports = issue width -- More complex control and routing logic Distributed: + Each RS can be specialized to the functional unit it serves (more area efficient) + Number of ports can be 1 (RS per functional unit) + Simpler control and routing logic -- Other RS’s cannot be used if one functional unit’s RS is full (static partitioning) 20

Issues in Scheduling Logic (I) What if multiple instructions become ready in the same cycle?  Why does this happen? An instruction with multiple dependents Multiple instructions can complete in the same cycle Which one to schedule?  Oldest  Random  Most dependents first  Most critical first (longest latency dependency chain) Does not matter for performance unless the active window is very large  Butler and Patt, “An Investigation of the Performance of Various Dynamic Scheduling Techniques,” MICRO

Issues in Scheduling Logic (II) When to schedule the dependents of a multi-cycle execute instruction?  One cycle before the completion of the instruction  Example: IMUL, Pentium 4, 3-cycle ADD When to schedule the dependents of a variable-latency execute instruction?  A load can hit or miss in the data cache  Option 1: Schedule dependents assuming load will hit  Option 2: Schedule dependents assuming load will miss When do we take out an instruction from a reservation station? 22