Defining Wakeup Width for Efficient Dynamic Scheduling A. Aggarwal, O. Ergin – Binghamton University M. Franklin – University of Maryland Presented by:

Slides:

Advertisements

Similar presentations

Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.

Advertisements

CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Exploring Wakeup-Free Instruction Scheduling Jie S. Hu, N. Vijaykrishnan, and Mary Jane Irwin Microsystems Design Lab The Pennsylvania State University.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

CHIMAERA: A High-Performance Architecture with a Tightly-Coupled Reconfigurable Functional Unit Kynan Fraser.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

Energy-efficient Instruction Dispatch Buffer Design for Superscalar Processors* Gurhan Kucuk, Kanad Ghose, Dmitry V. Ponomarev Department of Computer Science.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

ISLPED 2003 Power Efficient Comparators for Long Arguments in Superscalar Processors *supported in part by DARPA through the PAC-C program and NSF Dmitry.

CS 7810 Lecture 2 Complexity-Effective Superscalar Processors S. Palacharla, N.P. Jouppi, J.E. Smith U. Wisconsin, WRL ISCA ’97.

WCED: June 7, 2003 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-MadisonSlide 1 of 26 Exploring Efficient SMT Branch Predictor Design.

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Electrical and Computer Engineering Department

1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.

Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Oguz Ergin*, Deniz Balkan, Kanad Ghose, Dmitry Ponomarev Department.

CS 7810 Lecture 5 The Optimal Logic Depth Per Pipeline Stage is 6 to 8 FO4 Inverter Delays M.S. Hrishikesh, N.P. Jouppi, K.I. Farkas, D. Burger, S.W. Keckler,

UPC Reducing Power Consumption of the Issue Logic Daniele Folegnani and Antonio González Universitat Politècnica de Catalunya.

CS 7810 Lecture 3 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.

Reducing the Complexity of the Register File in Dynamic Superscalar Processors Rajeev Balasubramonian, Sandhya Dwarkadas, and David H. Albonesi In Proceedings.

CS 7810 Lecture 11 Delaying Physical Register Allocation Through Virtual-Physical Registers T. Monreal, A. Gonzalez, M. Valero, J. Gonzalez, V. Vinals.

1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,

Restrictive Compression Techniques to Increase Level 1 Cache Capacity Prateek Pujara Aneesh Aggarwal Dept of Electrical and Computer Engineering Binghamton.

Ryota Shioya, Masahiro Goshimay and Hideki Ando Micro 47 Presented by Kihyuk Sung.

ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad.

Author: D. Brooks, V.Tiwari and M. Martonosi Reviewer: Junxia Ma

1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power ISCA workshops Sign up for class presentations.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

Register Cache System not for Latency Reduction Purpose Ryota Shioya, Kazuo Horio, Masahiro Goshima, and Shuichi Sakai The University of Tokyo 1.

1 Lecture 21: Core Design, Parallel Algorithms Today: ARM Cortex A-15, power, sort and matrix algorithms.

CS Lecture 4 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.

A Centralized Cache Miss Driven Technique to Improve Processor Power Dissipation Houman Homayoun, Avesta Makhzan, Jean-Luc Gaudiot, Alex Veidenbaum University.

Energy-Effective Issue Logic Hasan Hüseyin Yılmaz.

Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.

Microprocessor Microarchitecture Limits of Instruction-Level Parallelism Lynn Choi Dept. Of Computer and Electronics Engineering.

A. Moshovos ©ECE Fall ‘07 ECE Toronto Out-of-Order Execution Structures.

Reducing Issue Logic Complexity in Superscalar Microprocessors Survey Project CprE 585 – Advanced Computer Architecture David Lastine Ganesh Subramanian.

CS Lecture 14 Delaying Physical Register Allocation Through Virtual-Physical Registers T. Monreal, A. Gonzalez, M. Valero, J. Gonzalez, V. Vinals.

Increasing Cache Efficiency by Eliminating Noise Prateek Pujara & Aneesh Aggarwal {prateek,

1 Dynamically Heterogeneous Cores Through 3D Resource Pooling Houman Homayoun Vasileios Kontorinis Amirali Shayan Ta-Wei Lin Dean M. Tullsen Speaker: Houman.

Reduction of Register File Power Consumption Approach: Value Lifetime Characteristics - Pradnyesh Gudadhe.

Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Cyclone: A Low-Complexity Broadcast-Free Dynamic Instruction.

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.

The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

PipeliningPipelining Computer Architecture (Fall 2006)

CSE431 L13 SS Execute & Commit.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 13: SS Backend (Execute, Writeback & Commit) Mary Jane.

CSE 340 Computer Architecture Summer 2016 Understanding Performance.

Graduate Seminar Using Lazy Instruction Prediction to Reduce Processor Wakeup Power Dissipation Houman Homayoun April 2005.

Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,

Dynamic Scheduling Why go out of style?

Lynn Choi Dept. Of Computer and Electronics Engineering

Lecture 6: Advanced Pipelines

Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Power-Aware Microprocessors

Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Out-of-Order Execution Structures Optimizations

Sizing Structures Fixed relations Empirical (simulation-based)

Restrictive Compression Techniques to Increase Level 1 Cache Capacity

Presentation transcript:

Defining Wakeup Width for Efficient Dynamic Scheduling A. Aggarwal, O. Ergin – Binghamton University M. Franklin – University of Maryland Presented by: Deniz Balkan

Defining Wakeup Width for Efficient Dynamic Scheduling A. Aggarwal, O. Ergin – Binghamton Univ., M. Franklin – Univ. of Maryland Dynamic Scheduler Workings of a dynamic scheduler –Wakeup dependent instructions –Select instructions from a pool of ready instructions Both these operations form a critical path Increase of a single cycle in this critical path impacts performance

Defining Wakeup Width for Efficient Dynamic Scheduling A. Aggarwal, O. Ergin – Binghamton Univ., M. Franklin – Univ. of Maryland Implications of a large Dynamic Scheduler Large dynamic scheduler has the potential to exploit more ILP –Larger issue queue –Larger issue width Implications –Longer wire delays associated with driving register tags –Longer wire delays in driving tag comparison results –Longer select logic latency Overall increased scheduler latency, resulting in slower clock speed

Defining Wakeup Width for Efficient Dynamic Scheduling A. Aggarwal, O. Ergin – Binghamton Univ., M. Franklin – Univ. of Maryland Contributions of this paper Wakeup width definition – effective number of results used for instruction wakeup –Usually equal to the issue width Reduced wakeup width dynamic scheduler –Issue width remains the same –Reduces instruction wakeup latency, energy consumption, and area –Less than 2% reduction in IPC

Defining Wakeup Width for Efficient Dynamic Scheduling A. Aggarwal, O. Ergin – Binghamton Univ., M. Franklin – Univ. of Maryland Program Behavior Study Not all instructions produce a result –Branch and store instructions form about 30% Entire issue width of the processor not used in every cycle Average number of tags generated per cycle considerably less than the processor issue width

Defining Wakeup Width for Efficient Dynamic Scheduling A. Aggarwal, O. Ergin – Binghamton Univ., M. Franklin – Univ. of Maryland Tags generated in a cycle To generate more tags per cycle, used a fetch, issue and commit width of 12 Almost 50% of cycles have either 0 or 1 tag generated, even with a large issue width About 80% of the cycles have 3 or less tags generated per cycle

Defining Wakeup Width for Efficient Dynamic Scheduling A. Aggarwal, O. Ergin – Binghamton Univ., M. Franklin – Univ. of Maryland Useful tags Not all the generated tags are immediately useful –Branch mispredictions lead to tags generated along wrong path, and tags not immediately required –Dependent instructions not present in issue queue or waiting for other operands Average number of useful tags in a cycle even less than the average number of tags generated in a cycle

Defining Wakeup Width for Efficient Dynamic Scheduling A. Aggarwal, O. Ergin – Binghamton Univ., M. Franklin – Univ. of Maryland Useful tags Only about 50-60% of instructions produce a tag that is immediately required

Defining Wakeup Width for Efficient Dynamic Scheduling A. Aggarwal, O. Ergin – Binghamton Univ., M. Franklin – Univ. of Maryland Reduced Wakeup Width Dynamic Scheduler Wakeup width reduced while retaining the issue width intact –Some tags may have to wait before waking up the dependent instructions Performance impact is not expected to be high –Soon there will be cycles with fewer tags –Waiting tags can use the available wakeup slots –Delays in not immediately useful tags may not have any performance impact

Defining Wakeup Width for Efficient Dynamic Scheduling A. Aggarwal, O. Ergin – Binghamton Univ., M. Franklin – Univ. of Maryland Hardware Implementation – Conventional DS Select logic decides which instruction executes on which FU Register tags of issued instructions placed in tag-latches Enable signals controlled to enable the drivers that drive the tags across the instruction window

Defining Wakeup Width for Efficient Dynamic Scheduling A. Aggarwal, O. Ergin – Binghamton Univ., M. Franklin – Univ. of Maryland Hardware Implementation – RWW DS Wakeup width reduced to half the issue width Two tag latches/FUs share common tag-lines If both tag-latches hold tags, only one of them is driven, the other remains in the tag-latch To prevent overwriting, 1-bit indicator latch used to control the selection process

Defining Wakeup Width for Efficient Dynamic Scheduling A. Aggarwal, O. Ergin – Binghamton Univ., M. Franklin – Univ. of Maryland FU arbiter Decides the instruction to be executed on the FU Conventional arbiter giving priority to oldest instruction Arbiter with RWW dynamic scheduler, where “a” is the value of the indicator latch for the arbiter Grant1 = req0 AND req1 AND enable Grant1 = req0 AND a AND req1 AND enable

Defining Wakeup Width for Efficient Dynamic Scheduling A. Aggarwal, O. Ergin – Binghamton Univ., M. Franklin – Univ. of Maryland Experimental Setup Simulator based on Simplescalar to collect the performance statistics Delay, energy, and area estimation from the actual VLSI layouts using SPICE, in a 0.18 micron 6 metal layer CMOS process (TSMC) Dynamic scheduler size – 128-entry issue queue, 6-way issue width

Defining Wakeup Width for Efficient Dynamic Scheduling A. Aggarwal, O. Ergin – Binghamton Univ., M. Franklin – Univ. of Maryland Performance Results Compared to I6W6 (Issue Width 6, Wakeup Width 6) configuration –I6W3 has 15% lower wakeup logic latency IPC impact about 5% for I6W3 –Higher for high IPC FP benchmarks –Significantly better than I3W3, with the same wakeup logic latency as I6W3

Defining Wakeup Width for Efficient Dynamic Scheduling A. Aggarwal, O. Ergin – Binghamton Univ., M. Franklin – Univ. of Maryland IPC of FP benchmarks with RWW Reasons of IPC impact Instructions delayed due to waiting tags Issue slots wasted because of waiting tags

Defining Wakeup Width for Efficient Dynamic Scheduling A. Aggarwal, O. Ergin – Binghamton Univ., M. Franklin – Univ. of Maryland Reasons of IPC impact Delayed register tags have more impact than issue slot wastage With reducing wakeup width, the impact of delayed register tags increases dramatically

Defining Wakeup Width for Efficient Dynamic Scheduling A. Aggarwal, O. Ergin – Binghamton Univ., M. Franklin – Univ. of Maryland Area and Energy Results Activation statistics obtained through simulations, and the energy consumption values from our detailed layouts –I6W3 reduced wakeup logic energy consumption by 10% Area of the CAM cells (tag part of the instruction window) reduces by about 30% for I6W3

Defining Wakeup Width for Efficient Dynamic Scheduling A. Aggarwal, O. Ergin – Binghamton Univ., M. Franklin – Univ. of Maryland Reduced Issue Slots Wastage (RWIS) Issue slots wasted because no instructions issued to FUs with already waiting tags Classified instructions into –Tag-producing instructions –Non-tag-producing instructions Can still issue non-tag-producing instructions to FUs with waiting tags without overwriting the tag value Type bit included with the instruction to control issue

Defining Wakeup Width for Efficient Dynamic Scheduling A. Aggarwal, O. Ergin – Binghamton Univ., M. Franklin – Univ. of Maryland Reduced Tag Delays (RTD) Register tags delayed when multiple tag- producing instructions issued to the FUs sharing the tag-lines (FU-group) RTD limits the number of tag-producing instructions issued to an FU-group –Waiting tags of the previous cycle used for this purpose Non-tag-producing instructions can still be issued to FUs with indicator bits set

Defining Wakeup Width for Efficient Dynamic Scheduling A. Aggarwal, O. Ergin – Binghamton Univ., M. Franklin – Univ. of Maryland Enhanced Performance RTD-1 (with a maximum of 1 waiting tag) is the most effective RWIS reduces the wastage of issue slots, RTD also reduces waiting register tags RTD-2 results in more instructions getting delayed (compared to RTD-1) due to waiting register tags

Defining Wakeup Width for Efficient Dynamic Scheduling A. Aggarwal, O. Ergin – Binghamton Univ., M. Franklin – Univ. of Maryland Conclusions Larger dynamic schedulers can exploit more ILP, thus increasing performance Larger dynamic scheduler results in longer scheduler latency Reduced wakeup width (RWW) dynamic scheduler exploits the property that the number of useful tags generated per cycle are significantly less than the issue width Significant reduction in wakeup logic latency and dynamic scheduler area and energy consumption with minimal IPC impact