Out-of-Order Execution Structures Optimizations

Slides:

Advertisements

Similar presentations

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.

Advertisements

Real-time Signal Processing on Embedded Systems Advanced Cutting-edge Research Seminar I&III.

CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.

Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Exploring Wakeup-Free Instruction Scheduling Jie S. Hu, N. Vijaykrishnan, and Mary Jane Irwin Microsystems Design Lab The Pennsylvania State University.

Multithreading processors Adapted from Bhuyan, Patterson, Eggers, probably others.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

Register Renaming & Value Prediction. Overview ► Need for Post-RISC ► Register Renaming vs. Allocation Strategies ► How to compile for Post-RISC machines.

Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Oguz Ergin*, Deniz Balkan, Kanad Ghose, Dmitry Ponomarev Department.

Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.

EECS 470 Dynamic Scheduling – Part II Lecture 10 Coverage: Chapter 3.

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

1 Practical Selective Replay for Reduced-Tag Schedulers Dan Ernst and Todd Austin Advanced Computer Architecture Lab The University of Michigan June 8.

Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.

1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.

Trace cache and Back-end Oper. CSE 4711 Instruction Fetch Unit Using I-cache I-cache I-TLB Decoder Branch Pred Register renaming Execution units.

A. Moshovos ©ECE Fall ‘07 ECE Toronto Out-of-Order Execution Structures.

Reducing Issue Logic Complexity in Superscalar Microprocessors Survey Project CprE 585 – Advanced Computer Architecture David Lastine Ganesh Subramanian.

CS Lecture 14 Delaying Physical Register Allocation Through Virtual-Physical Registers T. Monreal, A. Gonzalez, M. Valero, J. Gonzalez, V. Vinals.

Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Cyclone: A Low-Complexity Broadcast-Free Dynamic Instruction.

The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.

OOO Pipelines - II Smruti R. Sarangi IIT Delhi 1.

15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.

Use of Pipelining to Achieve CPI < 1

CS 352H: Computer Systems Architecture

Precise Exceptions and Out-of-Order Execution

Multiscalar Processors

Smruti R. Sarangi IIT Delhi

Lynn Choi Dept. Of Computer and Electronics Engineering

CS203 – Advanced Computer Architecture

Chapter 14 Instruction Level Parallelism and Superscalar Processors

Smruti R. Sarangi Computer Science and Engineering, IIT Delhi

Microprocessor Microarchitecture Dynamic Pipeline

Introduction to Pentium Processor

Half-Price Architecture

Advantages of Dynamic Scheduling

Sequential Execution Semantics

Instruction Level Parallelism and Superscalar Processors

High-level view Out-of-order pipeline

Out of Order Processors

Superscalar Processors & VLIW Processors

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Smruti R. Sarangi IIT Delhi

ECE 2162 Reorder Buffer.

Address-Value Delta (AVD) Prediction

Instruction Level Parallelism and Superscalar Processors

Out-of-Order Commit Processor

Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Ka-Ming Keung Swamy D Ponpandi

Smruti R. Sarangi Computer Science and Engineering, IIT Delhi

15-740/ Computer Architecture Lecture 5: Precise Exceptions

Checking for issue/dispatch

Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)

How to improve (decrease) CPI

Advanced Computer Architecture

Control unit extension for data hazards

Instruction Level Parallelism (ILP)

Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Tomasulo Organization

* From AMD 1996 Publication #18522 Revision E

Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution

Prof. Onur Mutlu Carnegie Mellon University

CSC3050 – Computer Architecture

Conceptual execution on a processor which exploits ILP

Ka-Ming Keung Swamy D Ponpandi

Spring 2019 Prof. Eric Rotenberg

Dynamic Scheduling Physical Register File ready bits Issue Queue (IQ)

Presentation transcript:

Out-of-Order Execution Structures Optimizations A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Tag Elimination A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Conventional Schedulers are Overdesigned For MIPS-like ISA Two source tags One destination tag Not all instructions use two source operands Eg, addi $1, $2, 10 Not all instructions produce a result that is interesting for scheduling E.g., beq Some operands are ready when the instruction enters the scheduler Source: Efficient Dynamic Scheduling Through Tag Elimination, Dan Ernst and Todd Austin, ISCA 2002 A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Some Operands are Ready when the Instruction Enters the Scheduler A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Window Specialization Have reservation stations with different source operand wait capabilities A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Window Specialization At rename check how many source operands are not ready If there is an appropriate slot proceed to schedule If not, stall at rename Advantages: Destination bus only runs over reservation stations with comparators Load on the destination bus is reduced Disadvantages: Stalls due to unavailability of reservation stations Complexity of res. Station assignment A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Window Specialization - Performance Performance as IPC – Actual Clock Frequency not considered A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Window Specialization - Performance Performance as IPC per ns A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Last Tag Prediction Observe: Instruction becomes ready after the last tag it waits for appears Last Tag prediction Predict which of the two tags will that be Speculatively execute Correct speculation: that was the last tag Incorrect speculation: Need to reschedule Detection? Try to read a value that is not available A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

GShare-Style Last Tag Prediction Two-bit saturating counters A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Accuracy Over all instructions with two outstanding operands A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Window Specialization - Performance Performance as IPC – Actual Clock Frequency not considered A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Window Specialization - Performance Performance as IPC per ns A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Prescheduling Data-flow prescheduling for large instruction windows in out-of-order processors Pierre Michaud, André Seznec, HPCA 2001 A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Prescheduling Predict latencies Put scheduled instructions into a FIFO Slide into a smaller window A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Prescheduling Method A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Prescheduling Example A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Latency Prediction A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Latency Prediction Contd. A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Broadcast Free Scheduler A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Broadcast Free Scheduler Cyclone design D. Ernst, A. Hamel, T. Austin ISCA 2003 Preschedule Instructions Put them into a dual strip cyclical FIFO Vertical paths allow for motion between the strips A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Cyclone Architecture Will be ready in cycle + 6 A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Cyclone Architecture – Cycle +1 A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Cyclone Architecture – Cycle + 2 A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Cyclone Architecture – Cycle + 3 A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Cyclone Architecture – Cycle + 4 A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Cyclone Architecture – Cycle + 5 A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Cyclone Architecture – Cycle + 6 A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Cyclone Architecture – Cycle + 6 A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Cyclone Architecture – Mis-scheduling Estimate new latency A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Pre-scheduler Can only do two cascaded MAX calculations Due to timing considerations Insert instruction with predicted latency N at the front of the FIFO Have it switch at N/2 A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Cyclone IPC Performance A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Cyclone True Performance and Area A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Matrix Schedulers A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Conventional Scheduler IW grants WS requests A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Conventional Scheduler Timing B1 B3 B1 Can’t pipeline without introducing Bubbles between dependent Instructions: A2 Source: A High-Speed Dynamic Instruction Scheduling Scheme for Superscalar Processors Masahiro Goshima Kengo Nishino Yasuhiko Nakashima Shin-ichiro Mori Toshiaki Kitamura Shinji Tomita MICRO 2001 B3 A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Towards a Matrix Scheduler Observe: In conventional scheduling dependences are discovered twice: Once at renaming Once during scheduling Why? Dependences are implicitly represented Producer and Consumer link via a name This is indirect Matrix Scheduler idea: Represent dependences explicitly A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Dependence Matrix Who do I depend upon? Left source Right source Who am I A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Matrix Scheduler Write port wakeup A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Inserting an entry Write port A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Wakeup wakeup A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Mispeculation Recovery Do not cleanup Use external logic to inhibit request signals A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Delay 0.18um 1.8V Partial wakeup lines 1.Matrix 85C 2. RAM+CAM Match to ready Delay Issue to cells 0.18um 1.8V 85C Partial wakeup lines 1.Matrix 2. RAM+CAM A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Delay measurement points A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Scheduling Priorities A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Conflict Resolution More instructions ready than available issue slots Which get to go? Age vs. Pseudo-Random Resolution Age is important Priority Enforcer picks the oldest Complex Source: Matrix Scheduler Reloaded ISCA 2007 A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Compacting Scheduler Implemented in the Alpha 21264 Physical order within scheduler corresponds to age Entry freed: Shift up all younger entries A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Goal A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Virtual Physical Registers Physical register names are used for two purposes Scheduling Communicating A physical register is held much in advance than needed We need the register only after the value is produced De-couple scheduling from communication names A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Used vs. Allocated Registers A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Virtual Physical Registers A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Deadlock Older instruction completes later than younger ones No registers available Steal a register and re-execute A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Performance vs. Physical Registers A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto