Bypass Aware Instruction Scheduling for Register File Power Reduction Sanghyun Park 2 Aviral Shrivastava 1 Nikil Dutt 1 Alex Nicolau 1 Yunheung Paek 2.

Slides:

Advertisements

Similar presentations

CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.

Advertisements

Thermal-Scheduling For Ultra Low Power Mobile Microprocessor May, Thermal-Scheduling For Ultra Low Power Mobile Microprocessor George Cai 1 Chee.

CML CML Presented by: Aseem Gupta, UCI Deepa Kannan, Aviral Shrivastava, Sarvesh Bhardwaj, and Sarma Vrudhula Compiler and Microarchitecture Lab Department.

5th International Conference, HiPEAC 2010 MEMORY-AWARE APPLICATION MAPPING ON COARSE-GRAINED RECONFIGURABLE ARRAYS Yongjoo Kim, Jongeun Lee *, Aviral Shrivastava.

Copyright © 2002 UCI ACES Laboratory A Design Space Exploration framework for rISA Design Ashok Halambi, Aviral Shrivastava,

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

© ACES Labs, CECS, ICS, UCI. Energy Efficient Code Generation Using rISA * Aviral Shrivastava, Nikil Dutt

Aggregating Processor Free Time for Energy Reduction Aviral Shrivastava 1 Eugene Earlie 2 Nikil Dutt 1 Alex Nicolau 1 1 Center For Embedded Computer Systems,

Scheduling Reusable Instructions for Power Reduction J.S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M.J. Irwin Proceedings of the Design, Automation.

Energy Efficient Instruction Cache for Wide-issue Processors Alex Veidenbaum Information and Computer Science University of California, Irvine.

Automatic Generation of Operation Tables for Fast Exploration of Bypasses in Embedded Systems Aviral Shrivastava 1 Nikil Dutt 1 Alex Nicolau 1 Sanghyun.

PBExplore: A Framework for CIL Exploration of Partial Bypasses in Embedded Processors Aviral Shrivastava 1 Nikil Dutt 1 Alex Nicolau 1 Eugene Earlie 2.

An Efficient Compiler Technique for Code Size Reduction using Reduced Bit-width ISAs S. Ashok Halambi, Aviral Shrivastava, Partha Biswas, Nikil Dutt, Alex.

Compilation Techniques for Energy Reduction in Horizontally Partitioned Cache Architectures Aviral Shrivastava, Ilya Issenin, Nikil Dutt Center For Embedded.

Thermal-Aware SoC Test Scheduling with Test Set Partitioning and Interleaving Zhiyuan He 1, Zebo Peng 1, Petru Eles 1 Paul Rosinger 2, Bashir M. Al-Hashimi.

Compiler-in-the-Loop ADL-driven Early Architectural Exploration Aviral Shrivastava 1 Nikil Dutt 1 Alex Nicolau 1 Eugene Earlie 2 1 Center For Embedded.

Temperature-Aware Design Presented by Mehul Shah 4/29/04.

DAC 2001: Paper 18.2 Center for Embedded Computer Systems, UC Irvine Center for Embedded Computer Systems University of California, Irvine

Author: D. Brooks, V.Tiwari and M. Martonosi Reviewer: Junxia Ma

Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine.

COM181 Computer Hardware Ian McCrumRoom 5B18,

Slide 1 U.Va. Department of Computer Science LAVA Architecture-Level Power Modeling N. Kim, T. Austin, T. Mudge, and D. Grunwald. “Challenges for Architectural.

Register Cache System not for Latency Reduction Purpose Ryota Shioya, Kazuo Horio, Masahiro Goshima, and Shuichi Sakai The University of Tokyo 1.

SRC Project Temperature-Aware SoC Optimization Framework PIs: Fadi Kurdahi & Nikil Dutt, UCI PIs: Fadi J. Kurdahi and Nikil D. Dutt Center for.

A Compiler-in-the-Loop (CIL) Framework to Explore Horizontally Partitioned Cache (HPC) Architectures Aviral Shrivastava*, Ilya Issenin, Nikil Dutt *Compiler.

Low Power Techniques in Processor Design

Sanghyun Park, §Aviral Shrivastava and Yunheung Paek

Copyright © 2008 UCI ACES Laboratory Kyoungwoo Lee 1, Aviral Shrivastava 2, Nikil Dutt 1, and Nalini Venkatasubramanian 1.

Speculative Software Management of Datapath-width for Energy Optimization G. Pokam, O. Rochecouste, A. Seznec, and F. Bodin IRISA, Campus de Beaulieu

Efficient Mapping onto Coarse-Grained Reconfigurable Architectures using Graph Drawing based Algorithm Jonghee Yoon, Aviral Shrivastava *, Minwook Ahn,

Mahesh Sukumar Subramanian Srinivasan. Introduction Embedded system products keep arriving in the market. There is a continuous growing demand for more.

Heat Stroke: Power-Density- Based Denial of Service in SMT Jahangir Hasan Ankit Jalote T. N. Vijaykumar School of Electrical & Computer Engineering, Purdue.

1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

Energy-Effective Issue Logic Hasan Hüseyin Yılmaz.

|Processors designed for low power |Architectural state is correct at basic block granularity rather than instruction granularity 2.

3 rd Nov CSV881: Low Power Design1 Power Estimation and Modeling M. Balakrishnan.

Design of a High-Throughput Low-Power IS95 Viterbi Decoder Xun Liu Marios C. Papaefthymiou Advanced Computer Architecture Laboratory Electrical Engineering.

Bypass Aware Instruction Scheduling for Register File Power Reduction Sanghyun Park, Aviral Shrivastava Nikil Dutt, Alex Nicolau Yunheung Paek Eugene Earlie.

Runtime Software Power Estimation and Minimization Tao Li.

1 Energy-Efficient Register Access Jessica H. Tseng and Krste Asanović MIT Laboratory for Computer Science, Cambridge, MA 02139, USA SBCCI2000.

Reduction of Register File Power Consumption Approach: Value Lifetime Characteristics - Pradnyesh Gudadhe.

Hardware Architectures for Power and Energy Adaptation Phillip Stanley-Marbell.

Power Analysis of Embedded Software : A Fast Step Towards Software Power Minimization 指導教授 : 陳少傑教授組員 : R 張馨怡 R 林秀萍.

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.

Varun Mathur Mingwei Liu Sanghyun Park, Aviral Shrivastava and Yunheung Paek.

1 RELOCATE Register File Local Access Pattern Redistribution Mechanism for Power and Thermal Management in Out-of-Order Embedded Processor Houman Homayoun,

Best detection scheme achieves 100% hit detection with

Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.

CML Branch Penalty Reduction by Software Branch Hinting Jing Lu Yooseong Kim, Aviral Shrivastava, and Chuan Huang Compiler Microarchitecture Lab Arizona.

15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.

Thermal-Aware Data Flow Analysis José L. Ayala – Complutense University (Spain) David Atienza – EPFL (Switzerland) Philip Brisk – EPFL (Switzerland)

Operation Tables for Scheduling in the presence of Partial Bypassing Aviral Shrivastava 1 Eugene Earlie 2 Nikil Dutt 1 Alex Nicolau 1 1 Center For Embedded.

CPU (Central Processing Unit). The CPU is the brain of the computer. Sometimes referred to simply as the processor or central processor, the CPU is where.

PipeliningPipelining Computer Architecture (Fall 2006)

Graduate Seminar Using Lazy Instruction Prediction to Reduce Processor Wakeup Power Dissipation Houman Homayoun April 2005.

CGRA Express: Accelerating Execution using Dynamic Operation Fusion

/ Computer Architecture and Design

A Review of Processor Design Flow

EPIMap: Using Epimorphism to Map Applications on CGRAs

Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke

Stephen Hines, David Whalley and Gary Tyson Computer Science Dept.

Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle

An Automated Design Flow for 3D Microarchitecture Evaluation

Reiley Jeyapaul and Aviral Shrivastava Compiler-Microarchitecture Lab

Die Stacking (3D) Microarchitecture -- from Intel Corporation

Code Transformation for TLB Power Reduction

Reiley Jeyapaul and Aviral Shrivastava Compiler-Microarchitecture Lab

*Qiang Zhu Fujitsu Laboratories LTD. Japan

Presentation transcript:

Bypass Aware Instruction Scheduling for Register File Power Reduction Sanghyun Park 2 Aviral Shrivastava 1 Nikil Dutt 1 Alex Nicolau 1 Yunheung Paek 2 Eugene Earlie 3 1 CECS, ICS, UC Irvine, CA, USA 2 SEE, SNU Seoul, Korea 3 SCL, Intel, Hudson, MA, USASC L

Copyright © 2006 UCI ACES Laboratory 2 Processor Power Power is now a primary architectural concern Power is now a primary architectural concern E.g.: Processor power consumption doubles w/ Pentium generations High Power Consumption High Power Consumption Increases packaging/cooling cost Limits achievable performance Especially Important for handheld embedded devices Especially Important for handheld embedded devices Battery life Weight Managing the Impact of Increasing… Gunther, Binns et. al, Intel Technology Journal Cost of Removing heat from a microprocessor Increasing power consumption Intel website

Copyright © 2006 UCI ACES Laboratory 3 Power Density Power Density = power per unit area Power Density = power per unit area Silicon is not a good conductor of heat Silicon is not a good conductor of heat Areas with high power density becomes hot Higher temperature increases leakage Positive feedback loop, possibly leading to thermal runaway Positive feedback loop, possibly leading to thermal runaway Important to distribute power over the die Important to distribute power over the die Must “attack” hot-spots [Fred Pollack, Intel Corp, MICRO 32 keynote] Heat Stroke - Have to stop if any part of die has more than critical temperature Research beginning to address power density Research beginning to address power density Temperature-Aware Floorplanning Surround high power density components with low-power density components Surround high power density components with low-power density components Migrate tasks across cores Distribute heat-intensive tasks across die Distribute heat-intensive tasks across die Many other efforts…

Copyright © 2006 UCI ACES Laboratory 4 Register File Power Register File is a significant source of power dissipation Register File is a significant source of power dissipation Motorola M.CORE – approx. 16% processor power RF may consume up to 25% of processor power High Register File Power density High Register File Power density Small size, causes Hotspots e.g., Alpha 21264, Intel Pentium Trend: increasing RF power due to Trend: increasing RF power due to Microarchitectural enhancements to improve IPC Compiler techniques to improve IPC Large Register Files (esp. VLIW processors)

Copyright © 2006 UCI ACES Laboratory 5 Heat Stroke from RF accesses Label1: add $1, $2, $3 br Label1 Repeated access to register file at high rate Create repeated hot spots at register file Heat-up time short (1.2ms), cooling time long (12ms) Degrades CPU utilization to 10% Slide from “Heat Stroke: Power-Density-Based Denial of Service in SMT”, Jahangir Hasan et. al, ISHPC 2005 Example

Copyright © 2006 UCI ACES Laboratory 6 Outline Previous work in reducing RF Power Previous work in reducing RF Power On-Demand RF Read Instruction Scheduling technique for RF Power reduction Instruction Scheduling technique for RF Power reduction Experiments Experiments Summary Summary

Copyright © 2006 UCI ACES Laboratory 7 Reducing RF Power: Related Work Evaluation/Estimation of RF Power and RF Power Density [ISLPED 98], [TCAD 01], [DATE 02] Three ways to reduce RF Power 1. Reduce energy per access to RF 2. Reduce # registers in RF 3. Reduce # accesses to RF 1. Reduce energy per access to RF Register File Design Considerations… Farkas, Jouppi, Chow, WRL Research Report, 1995 The Energy Complexity of Register Files, Zubyan, Kogge, ISPLED 1998 Energy Efficient Register Access, Tseng, Asanovic, SBCCI 2000

Copyright © 2006 UCI ACES Laboratory 8 Reducing RF Power: Related Work 2. Reduce # registers in RF Instruction Scheduling to minimize # overlapping live range Power-Aware Modulo Scheduling, Yun, Kim, ISLPED 2001 Lifetime-Sensitive Modulo Scheduling, Huff, PLDI 1993 Stage Scheduling … Eichenberger, Davidson, MICRO Reduce # accesses to RF Hierarchical Register File Reducing the Complexity of RF …, Balasubramonion et. al., MICRO 2001 Reducing the Complexity of RF …, Balasubramonion et. al., MICRO 2001 Most lifetimes are short  Temporarily hold register value in a buffer Reducing Register File Power… Hu, Martonosi, Workshop on Complexity-Effective Design 2000 Reducing Register File Power… Hu, Martonosi, Workshop on Complexity-Effective Design 2000 Reducing Register Ports using… Kim, Mudge, ICS 2003 Reducing Register Ports using… Kim, Mudge, ICS 2003

Copyright © 2006 UCI ACES Laboratory 9 Outline Previous work in reducing RF Power Previous work in reducing RF Power On-Demand RF Read Instruction Scheduling technique for RF Power reduction Instruction Scheduling technique for RF Power reduction Experiments Experiments Summary Summary

Copyright © 2006 UCI ACES Laboratory 10 “On-Demand” RF Read Existing processors anticipatorily read RF Existing processors anticipatorily read RF e.g., Pentium 4, Alpha SpecInt95 running on MIPS II SpecInt95 running on MIPS II 36% operands come from bypasses 8-issue SimpleScalar running SpecInt2K 8-issue SimpleScalar running SpecInt2K 50-70% operands come from bypasses Read from RF only if necessary (Teng & Asanovic, SBCCI 2000) Read from RF only if necessary (Teng & Asanovic, SBCCI 2000) First find out if the value is present in the bypasses If not, then read the value from RF We’ll call this “On-Demand RF Read” When applied to Intel XScale model When applied to Intel XScale model 58% energy reduction < 3% performance loss This paper: Further reduction in RF power by Instruction Scheduling

Copyright © 2006 UCI ACES Laboratory 11 Outline Previous work in reducing RF Power Previous work in reducing RF Power On-Demand RF Read Instruction Scheduling technique for RF Power reduction Instruction Scheduling technique for RF Power reduction Experiments Experiments Summary Summary

Copyright © 2006 UCI ACES Laboratory 12 Processor Model FDORX1 RF X2 WB Partially Bypassed Processor Pipeline Bypasses Pipeline Bypasses Improve performance Full bypassing Full bypassing Best performance, but high power & wiring complexity Partial Bypassing Partial Bypassing Keep only some bypasses Popular in embedded processors, e.g., Intel XScale

Copyright © 2006 UCI ACES Laboratory 13 Operation Execution Model On Demand RF Read On Demand RF Read Read source operands  bypass result  write back Read source operands  bypass result  write back FDORX1 RF X2 WB Add R1 R2 R3 Read R2, R3 from RF and bypasses Bypass R1 to second port of OR Do nothingWrite back R1 to RF

Copyright © 2006 UCI ACES Laboratory 14 How can scheduling help? Add R1 R2 R3 ADD R10 R11 R12 SUB R4 R5 R1 FDORX1 RF X2 WB Add R1 R2 R3 SUB R4 R5 R1 ADD R10 R11 R12 SUB CANNOT use bypass to read R1 SUB CAN use bypass to read R1 Instruction Scheduling can reduce RF usage!

Copyright © 2006 UCI ACES Laboratory 15 Bypass-sensitive RF Power-Aware Scheduling Schedule instructions so that Schedule instructions so that Dependent instruction transfer operands using bypasses Reduce RF usage Compiler needs to know Compiler needs to know When does an instruction bypass result? Which operands can read the result? When result is written into register file? Add R1 R2 R3 ADD R10 R11 R12 SUB R4 R5 R1 FDORX1 RF X2 WB Add R1 R2 R3 SUB R4 R5 R1 ADD R10 R11 R12 Compiler needs a detailed processor-operation model

Copyright © 2006 UCI ACES Laboratory 16 Operation Table (OT) Model all the resources and registers used by an operation in each cycle of its execution Model all the resources and registers used by an operation in each cycle of its execution Can determine which operands are available for each source operand Can determine which operands are available for each source operand Use OTs for scheduling to reduce the usage of RF Use OTs for scheduling to reduce the usage of RF Operation Table for ADD R1 R2 R3 1. F 2. D 3. OR ReadOperands R2 C1 RF R3 C2 RF C4 X1 DestOperands R R1 RF 4. X1 WriteOperands R1 C4 OR 5. X2 6. XWB WriteOperands R1 C3 RF FDORX1 RF X2 WB C1C2 C3 C4 Operation Tables for Scheduling in Partially Bypassed Processors – Shrivastava, Earlie, Dutt, Nicolau, CODES + ISSS 2004

Copyright © 2006 UCI ACES Laboratory 17 OT-based RF Power-Aware Scheduling Operation Tables (OTs) provide a mechanism Operation Tables (OTs) provide a mechanism To accurately estimate the number of operands read from RF Exploit OTs for scheduling to reduce RF usage Exploit OTs for scheduling to reduce RF usage Various scheduling strategies can be employed Choose scheduling heuristic with the least RF usage We evaluated 3 BB scheduling techniques We evaluated 3 BB scheduling techniques 1. RFPEX: Exhaustive 2. RFPN: Greedy O(n) 3. RFPN2: Greedy with one level of backtracking O(n 2 )

Copyright © 2006 UCI ACES Laboratory 18 Outline Previous work in reducing RF Power Previous work in reducing RF Power On-Demand RF Read Instruction Scheduling technique for RF Power reduction Instruction Scheduling technique for RF Power reduction Experiments Experiments Summary Summary

Copyright © 2006 UCI ACES Laboratory 19 Experimental Setup Intel XScale Intel XScale 7 –stage, partially bypassed On-Demand RF Read Architecture RF Power Model RF Power Model = # Register File Accesses MiBench benchmarks MiBench benchmarks Scheduler Scheduler Operation Table - based RF Power-Aware Scheduling Within Basic Block Tried 3 strategies Tried 3 strategies RF Power Results RF Power Results Compare with On-Demand RF Read architecture as baseline GCC –O3 Assembly Executable Runtime RF Reads OT – based Scheduler Application Cycle-Accurate Simulator GCC linker

Copyright © 2006 UCI ACES Laboratory RFPEX Scheduling Exhaustive Exhaustive Try all legal permutations of instructions O(n!) Complexity n - # instructions in BB n - # instructions in BB Compilation Time Compilation Time Hours Could not schedule susan, rijndael (2 days) RF Power Reduction RF Power Reduction Average 12% Performance Improvement Performance Improvement Average 1.4% 26% reduction 7% improvement

Copyright © 2006 UCI ACES Laboratory RFPN Scheduling Greedy O(n) scheduling Greedy O(n) scheduling Pick instructions one by one Pick instruction which gets most operands from bypass O(n) Complexity n - # instructions in BB n - # instructions in BB Compilation time Compilation time Seconds RF Power Reduction RF Power Reduction Average 6% Performance Improvement Performance Improvement Average: -3.5%

Copyright © 2006 UCI ACES Laboratory RFPN2 Scheduling RFPN2 - Greedy with one level of backtracking RFPN2 - Greedy with one level of backtracking O(n 2 ) Complexity n - # instructions in BB n - # instructions in BB Compilation time Compilation time Minutes RF Power Reduction RF Power Reduction Average 10% Performance Improvement Performance Improvement Average: -2% RFPN2 works well !! RFPN2 works well !! Average 10% reduction

Copyright © 2006 UCI ACES Laboratory 23 Outline Previous work in reducing RF Power Previous work in reducing RF Power On-Demand RF Read Instruction Scheduling technique for RF Power reduction Instruction Scheduling technique for RF Power reduction Experiments Experiments Summary Summary

Copyright © 2006 UCI ACES Laboratory 24 Summary Register File is one of the main hotspots in processors Register File is one of the main hotspots in processors Very important to reduce RF Power Very important to reduce RF Power Repeated accesses cause “Heat Stroke” Up to 90% performance degradation On-Demand RF Read is an effective technique On-Demand RF Read is an effective technique 58% RF power reduction Scope for further RF power reduction via instruction scheduling Scope for further RF power reduction via instruction scheduling Contribution: Instruction Scheduling Technique for further RF power reduction Contribution: Instruction Scheduling Technique for further RF power reduction Up to 26%, Average 12% RF power reduction 2% performance degradation Over and above On-Demand RF Read architecture as baseline RFPN2 is an effective heuristic for RF Power reduction Future Work Future Work Beyond basic block scheduling