Scheduling Reusable Instructions for Power Reduction J.S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M.J. Irwin Proceedings of the Design, Automation.

Slides:

Advertisements

Similar presentations

Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.

Advertisements

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.

1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)

Loop Unrolling & Predication CSE 820. Michigan State University Computer Science and Engineering Software Pipelining With software pipelining a reorganized.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.

Pipelining and Control Hazards Oct

Lecture Objectives: 1)Define branch prediction. 2)Draw a state machine for a 2 bit branch prediction scheme 3)Explain the impact on the compiler of branch.

1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)

Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

8 Processing of control transfer instructions TECH Computer Science 8.1 Introduction 8.2 Basic approaches to branch handling 8.3 Delayed branching 8.4.

CPE 731 Advanced Computer Architecture ILP: Part II – Branch Prediction Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Power Savings in Embedded Processors through Decode Filter Cache Weiyu Tang, Rajesh Gupta, Alex Nicolau.

A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang, Vahid F., Lysecky R. Proceedings of Design, Automation and Test in Europe Conference.

EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.

Wish Branches A Review of “Wish Branches: Enabling Adaptive and Aggressive Predicated Execution” Russell Dodd - October 24, 2006.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

EECC722 - Shaaban #1 Lec # 10 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

Multiscalar processors

Benefits of Early Cache Miss Determination Memik G., Reinman G., Mangione-Smith, W.H. Proceedings of High Performance Computer Architecture Pages: 307.

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.

Pipelined Datapath and Control (Lecture #15) ECE 445 – Computer Organization The slides included herein were taken from the materials accompanying Computer.

An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.

EECC722 - Shaaban #1 Lec # 9 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.

Evaluation of Dynamic Branch Prediction Schemes in a MIPS Pipeline Debajit Bhattacharya Ali JavadiAbhari ELE 475 Final Project 9 th May, 2012.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

Exploiting Program Hotspots and Code Sequentiality for Instruction Cache Leakage Management J. S. Hu, A. Nadgir, N. Vijaykrishnan, M. J. Irwin, M. Kandemir.

2013/01/14 Yun-Chung Yang Energy-Efficient Trace Reuse Cache for Embedded Processors Yi-Ying Tsai and Chung-Ho Chen 2010 IEEE Transactions On Very Large.

Energy-Effective Issue Logic Hasan Hüseyin Yılmaz.

1 Dynamic Branch Prediction. 2 Why do we want to predict branches? MIPS based pipeline – 1 instruction issued per cycle, branch hazard of 1 cycle. –Delayed.

CSCI 6461: Computer Architecture Branch Prediction Instructor: M. Lancaster Corresponding to Hennessey and Patterson Fifth Edition Section 3.3 and Part.

Low-Power Cache Organization Through Selective Tag Translation for Embedded Processors with Virtual Memory Support Xiangrong Zhou and Peter Petrov Proceedings.

Branch.1 10/14 Branch Prediction Static, Dynamic Branch prediction techniques.

Reduction of Register File Power Consumption Approach: Value Lifetime Characteristics - Pradnyesh Gudadhe.

Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.

Power Analysis of Embedded Software : A Fast Step Towards Software Power Minimization 指導教授 : 陳少傑教授組員 : R 張馨怡 R 林秀萍.

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.

The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.

Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)

COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.

Pentium III Instruction Stream. Introduction Pentium III uses several key features to exploit ILP This part of our presentation will cover the methods.

Graduate Seminar Using Lazy Instruction Prediction to Reduce Processor Wakeup Power Dissipation Houman Homayoun April 2005.

CS 352H: Computer Systems Architecture

Multiscalar Processors

PowerPC 604 Superscalar Microprocessor

CS 704 Advanced Computer Architecture

CS203 – Advanced Computer Architecture

Improving Program Efficiency by Packing Instructions Into Registers

Pipelining: Advanced ILP

Stephen Hines, David Whalley and Gary Tyson Computer Science Dept.

Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

Control unit extension for data hazards

Sampoorani, Sivakumar and Joshua

Instruction Level Parallelism (ILP)

Control unit extension for data hazards

CSC3050 – Computer Architecture

Dynamic Hardware Prediction

Control unit extension for data hazards

Presentation transcript:

Scheduling Reusable Instructions for Power Reduction J.S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M.J. Irwin Proceedings of the Design, Automation and Test in Europe Conference and Exhibition, Volume: 1 Pages: , Feb. 2004

Scheduling Reusable Instructions for Power Reduction 2/ /6/20 Abstract  In this paper, we propose a new issue queue design that is capable of scheduling reusable instructions. Once the issue queue is reusing instructions, no instruction cache access is needed since the instructions are supplied by the issue queue itself. Furthermore, dynamic branch prediction and instruction decoding can also be avoided permitting the gating of the front-end stages of the pipeline (the stages before register renaming). Results using array-intensive codes show that up to 82% of the total execution cycles, the pipeline front-end can be gated, providing a power reduction of 72% in the instruction cache, 33% in the branch predictor, and 21% in the issue queue, respectively, at a small performance cost. Our analysis of compiler optimizations indicates that the power savings can be further improved by using optimized code.

Scheduling Reusable Instructions for Power Reduction 3/ /6/20 Outline  What’s the problem  Introduction  Implementation foundation and analysis  Proposed method and architecture  Experimental results and evaluation  Conclusions

Scheduling Reusable Instructions for Power Reduction 4/ /6/20 What’s the Problem  Power problem is not only a design limiter but also a major constraint in embedded systems design  The front-end of the pipeline (stages before register renaming) is a power-hungry component of an embedded microprocessor  For StrongARM, there is about 27% power dissipation in the instruction cache access  Furthermore, sophisticated branch predictors are also very power consuming  Therefore, seek to optimize the power consumption in the pipeline front-end becomes the most challenging issues in the design of an embedded processors

Scheduling Reusable Instructions for Power Reduction 5/ /6/20 Introduction  In recent years, several techniques have proposed to reduce the power consumption in the pipeline front-end:  Stage-skip pipeline Utilizes a decoded instruction buffer (DIB) to temporarily store decoded loop instructions for later reused  Disadvantage: 1) Requires ISA modification 2) Needs an additional instruction buffer 3) Buffering only one iteration of the loop (performance down)  Loop caches Dynamic/preloaded loop caches  Disadvantage: 1) Needs an additional loop cache 2) Buffering only one iteration of the loop (performance down)

Scheduling Reusable Instructions for Power Reduction 6/ /6/20 Introduction (cont.)  Filter cache Use smaller level zero cache to capture tight spatial / temporal locality in cache access  The proposed approach  New issue queue design based on superscalar architecture Scheduling reusable loop instructions within the issue queue  No need of an additional instruction buffer  Utilize the existing issue queue resources  Automatically unroll loops in the issue queue  No ISA modification  Be able to gate the front-end of pipeline  Address the power problem in the front-end of the pipeline

Scheduling Reusable Instructions for Power Reduction 7/ /6/20 The Baseline Datapath Base on MIPS Core (a) The Baseline Datapath Model of the MIPS R10000 (b) The Pipeline Stages of the Baseline Superscalar Microprocessor

Scheduling Reusable Instructions for Power Reduction 8/ /6/20 Implementation Analysis  Reusable instructions are those mainly belonging to loop structures that are repeatedly executed  The new issue queue design consists of the following four parts:  A loop structure detector  A mechanism to buffer the reusable instructions within the issue queue  A scheduling mechanism to reuse those buffered instructions in their program order  A recovery scheme from the reuse state to the normal state

Scheduling Reusable Instructions for Power Reduction 9/ /6/20 Issue Queue State Transition  Non-successful buffering will be revoked  Changing control flow in a buffering loop will cause buffering to be revoked  The front-end of the pipeline is gated during Code_Reuse state  Misprediction due to normally exiting a loop will restore issue queue to Normal state

Scheduling Reusable Instructions for Power Reduction 10/ /6/20 Detecting Reusable Loop Structures  Conditional branch and direct jump instructions may form the last instruction of a loop iteration  Add logic to check: (a) Backward branch/jump (b) Loop size is no larger than the issue queue size  This check is performed at decode stage using predicted target address  After a capturable loop is detected then starts to buffer instructions as the second iteration begins

Scheduling Reusable Instructions for Power Reduction 11/ /6/20 Buffering Reusable Instructions  Extend issue queue micro-architecture  Buffering a reusable instruction requires several operations :  The Classification Bit is set With the Classification Bit set, the instruction will not be removed from the issue queue even after it has been issued  The Issue State Bit is reset to zero  The logical register numbers are recorded in the Logic Register List (LRL)

Scheduling Reusable Instructions for Power Reduction 12/ /6/20 Strategy of When to Stop Buffering  Buffering only one iteration of the loop  Advantage: Enters Code_Reuse state and gates the pipeline front-end much earlier (gain much more power reduction)  Buffering multiple iterations of the loop  Advantage: It automatically Unrolls the loop to exploit more ILP The issue queue resource is used more effectively  Although the second strategy does not gate the pipeline front-end as fast as the first strategy, we choose the second one for performance sake

Scheduling Reusable Instructions for Power Reduction 13/ /6/20 Optimizing Loop Buffering Strategy  During Loop_Buffering state, if  An inner loop is detected  The execution exits the current loop  A procedure call within a loop causes the issue queue to use up before the loop end is met The current loop is identified as a non-bufferable loop  Example of an non-bufferable loop  Use Non-bufferable loop table (NBLT) to store non-bufferable loops  If a detected loop appears in NBLT, no buffering is attempted for this loop. With this optimization, the issue queue can eliminate most of the buffering of non-bufferable loops

Scheduling Reusable Instructions for Power Reduction 14/ /6/20 Reusing Buffered Instructions  During Code_Reuse state, instruction cache access and instruction decoding is disable  Needs a mechanism to reuse the buffered instructions already in the issue queue  Thus, the instructions are supplied by the issue queue itself

Scheduling Reusable Instructions for Power Reduction 15/ /6/20 Reusing Buffered Instructions (cont.)  Utilizes a reuse pointer to scan for instructions to be reused  Check the issue state bits of the first n instructions starting from the reuse pointer, if they are set, the logical register numbers are sent for register- renaming to reuse those inst. and the reuse pointer then advanced by n to scan instructions for the next cycle  Renamed instructions update their corresponding entries (e.g., register information) in the issue queue

Scheduling Reusable Instructions for Power Reduction 16/ /6/20 Reusing Buffered Instructions (cont.)  Utilizes a reuse pointer to scan for instructions to be reused  Check the issue state bits of the first n instructions starting from the reuse pointer, if they are set, the logical register numbers are sent for register- renaming to reuse those inst. and the reuse pointer then advanced by n to scan instructions for the next cycle  Renamed instructions update their corresponding entries (e.g., register information) in the issue queue  Scan & send for register renaming  After renaming, update register information in the issue queue

Scheduling Reusable Instructions for Power Reduction 17/ /6/20 Reusing Buffered Instructions (cont.)  Reuse pointer is automatically reset to the position of the first buffered instruction after the last buffered instruction is reused  This process is repeat until a branch misprediction is detected Misprediction due to normally exiting a loop will restore issue queue to Normal state  During Code_Reuse state, dynamic branch prediction is disable  Branch instructions are statically predicted The static prediction works well since the branches within loops are normally highly-biased for one direction  The static prediction is still verified after the branch completes  If the static prediction is detected to be incorrect ( Misprediction) during this verification, the issue queue will exit Code_Reuse state and restore to Normal state

Scheduling Reusable Instructions for Power Reduction 18/ /6/20 Restoring Normal State  Recovery process of revoking the current buffering state  Check the classification bit and issue state bit of all instructions, if they are set, it is immediately removed from the issue queue  The remaining instruction’s classification bit are then cleared  Recovery process due to misprediction in the Loop_Buffering state  Besides above process, Instructions newer than this branch are removed from the issue queue and ROB  Recovery process due to misprediction in the Code_Reuse state  Check the classification bit and issue state bit of all instructions, if they are set, it is immediately removed from the issue queue  The remaining instruction’s classification bit are then cleared  Instructions newer than this branch are removed from the issue queue and ROB  The gating signal is reset

Scheduling Reusable Instructions for Power Reduction 19/ /6/20 Results - Rate of Gated Front End  On the average, the pipeline front-end gated rate increase from 42% to 82% as the issue queue size increase  However, increasing issue queue size does’t always improve the gated rate (e.g. tsf and wss)  Larger issue queue will buffer more iteration and delay the pipeline gating

Scheduling Reusable Instructions for Power Reduction 20/ /6/20 Results - Power Savings in Front End  On the average, a power reduction of  35% - 72% in ICache  19% - 33% in branch predictor  12% - 21% in issue queue (due to partial update)  with < 2% overhead. (due to supporting logic) as the issue queue size increase

Scheduling Reusable Instructions for Power Reduction 21/ /6/20 Results - Overall Power Reduction  On the average, the power reduction is improved from 8% to 12% as the issue queue size increase  But for some configuration the overall power is increased (e.g. larger loop structure with smaller issue queue size)  Does’t gate front-end and consume power in the supporting logic

Scheduling Reusable Instructions for Power Reduction 22/ /6/20 Results - Performance Loss  Average performance loss ranges from 0.2% to 4% as the issue queue size increase  Due to the non-fully utilized issue queue as we buffering integer number of iterations of the loops

Scheduling Reusable Instructions for Power Reduction 23/ /6/20 Results - Impact of Compiler Optimizations  Larger loop structure can hardly be captured with a small issue queue  Perform loop distribution optimization to reduce the size of loop body  Break a loop into two or more smaller loop  Gear the loop code towards a given issue queue size  Optimized code increases power savings from 8% to 13% with issue queue size of 64 entries

Scheduling Reusable Instructions for Power Reduction 24/ /6/20 Conclusions  Proposed a new issue queue architecture  Detect capturable loop code  Buffer loop code in the issue queue  Reuse those buffered loop instructions  Significant power reduction in pipeline front- end components while gated (e.g. Icache, Bpred and Idecoder)  Compiler optimizations can further improve the power savings