1 RELOCATE Register File Local Access Pattern Redistribution Mechanism for Power and Thermal Management in Out-of-Order Embedded Processor Houman Homayoun,

Slides:



Advertisements
Similar presentations
CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.
Advertisements

Reducing Leakage Power in Peripheral Circuits of L2 Caches Houman Homayoun and Alex Veidenbaum Dept. of Computer Science, UC Irvine {hhomayou,
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
Keeping Hot Chips Cool Ruchir Puri, Leon Stok, Subhrajit Bhattacharya IBM T.J. Watson Research Center Yorktown Heights, NY Circuits R-US.
Microprocessor Microarchitecture Dependency and OOO Execution Lynn Choi Dept. Of Computer and Electronics Engineering.
IVF: Characterizing the Vulnerability of Microprocessor Structures to Intermittent Faults Songjun Pan 1,2, Yu Hu 1, and Xiaowei Li 1 1 Key Laboratory of.
CML CML Presented by: Aseem Gupta, UCI Deepa Kannan, Aviral Shrivastava, Sarvesh Bhardwaj, and Sarma Vrudhula Compiler and Microarchitecture Lab Department.
Power, Temperature, Reliability and Performance - Aware Optimizations in On-Chip SRAMs Houman Homayoun PhD Candidate Dept. of Computer Science, UC Irvine.
Adaptive Techniques for Leakage Power Management in L2 Cache Peripheral Circuits Houman Homayoun Alex Veidenbaum and Jean-Luc Gaudiot Dept. of Computer.
Mitigating the Performance Degradation due to Faults in Non-Architectural Structures Constantinos Kourouyiannis Veerle Desmet Nikolas Ladas Yiannakis Sazeides.
Copyright © 2002 UCI ACES Laboratory A Design Space Exploration framework for rISA Design Ashok Halambi, Aviral Shrivastava,
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science August 20, 2009 Enabling.
Chuanjun Zhang, UC Riverside 1 Low Static-Power Frequent-Value Data Caches Chuanjun Zhang*, Jun Yang, and Frank Vahid** *Dept. of Electrical Engineering.
University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.
Bypass Aware Instruction Scheduling for Register File Power Reduction Sanghyun Park 2 Aviral Shrivastava 1 Nikil Dutt 1 Alex Nicolau 1 Yunheung Paek 2.
Chia-Yen Hsieh Laboratory for Reliable Computing Microarchitecture-Level Power Management Iyer, A. Marculescu, D., Member, IEEE IEEE Transaction on VLSI.
Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Oguz Ergin*, Deniz Balkan, Kanad Ghose, Dmitry Ponomarev Department.
Energy Efficient Instruction Cache for Wide-issue Processors Alex Veidenbaum Information and Computer Science University of California, Irvine.
Cost-Effective Register File Soft Error reduction Pablo Montesinos, Wei Liu and Josep Torellas, University of Illinois at Urbana-Champaign.
An Efficient Compiler Technique for Code Size Reduction using Reduced Bit-width ISAs S. Ashok Halambi, Aviral Shrivastava, Partha Biswas, Nikil Dutt, Alex.
CS 7810 Lecture 15 A Case for Thermal-Aware Floorplanning at the Microarchitectural Level K. Sankaranarayanan, S. Velusamy, M. Stan, K. Skadron Journal.
Temperature Aware Microprocessor Floorplanning Considering Application Dependent Power Load *Chunta Chu, Xinyi Zhang, Lei He, and Tom Tong Jing Electrical.
Analysis of Instruction-level Vulnerability to Dynamic Voltage and Temperature Variations ‡ Computer Science and Engineering, UC San Diego variability.org.
Seok-Won Seong and Prabhat Mishra University of Florida IEEE Transaction on Computer Aided Design of Intigrated Systems April 2008, Vol 27, No. 4 Rahul.
SRC Project Temperature-Aware SoC Optimization Framework PIs: Fadi Kurdahi & Nikil Dutt, UCI PIs: Fadi J. Kurdahi and Nikil D. Dutt Center for.
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
Evaluation of Memory Consistency Models in Titanium.
Copyright © 2012 Houman Homayoun 1 Dynamically Heterogeneous Cores Through 3D Resource Pooling Houman Homayoun Vasileios Kontorinis Amirali Shayan Ta-Wei.
Low Power Techniques in Processor Design
Sanghyun Park, §Aviral Shrivastava and Yunheung Paek
Exploiting the cache capacity of a single-chip multicore processor with execution migration Pierre Michaud February 2004.
Dept. of Computer Science, UC Irvine
Copyright © 2008 UCI ACES Laboratory Kyoungwoo Lee 1, Aviral Shrivastava 2, Nikil Dutt 1, and Nalini Venkatasubramanian 1.
Speculative Software Management of Datapath-width for Energy Optimization G. Pokam, O. Rochecouste, A. Seznec, and F. Bodin IRISA, Campus de Beaulieu
Power Management in High Performance Processors through Dynamic Resource Adaptation and Multiple Sleep Mode Assignments Houman Homayoun National Science.
A Centralized Cache Miss Driven Technique to Improve Processor Power Dissipation Houman Homayoun, Avesta Makhzan, Jean-Luc Gaudiot, Alex Veidenbaum University.
Energy-Effective Issue Logic Hasan Hüseyin Yılmaz.
Procedure Hopping: a Low Overhead Solution to Mitigate Variability in Shared-L1 Processor Clusters Abbas Rahimi.
1 Tuning Garbage Collection in an Embedded Java Environment G. Chen, R. Shetty, M. Kandemir, N. Vijaykrishnan, M. J. Irwin Microsystems Design Lab The.
Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.
Garo Bournoutian and Alex Orailoglu Proceedings of the 45th ACM/IEEE Design Automation Conference (DAC’08) June /10/28.
Multiple Sleep Mode Leakage Control for Cache Peripheral Circuits in Embedded Processors Houman Homayoun, Avesta Makhzan, Alex Veidenbaum Dept. of Computer.
Low-Power Cache Organization Through Selective Tag Translation for Embedded Processors with Virtual Memory Support Xiangrong Zhou and Peter Petrov Proceedings.
Bypass Aware Instruction Scheduling for Register File Power Reduction Sanghyun Park, Aviral Shrivastava Nikil Dutt, Alex Nicolau Yunheung Paek Eugene Earlie.
Houman Homayoun, Sudeep Pasricha, Mohammad Makhzan, Alex Veidenbaum Center for Embedded Computer Systems, University of California, Irvine,
Architectural and Circuit-Levels Design Techniques for Power and Temperature Optimizations in On- Chip SRAM Memories Houman Homayoun PhD Candidate Dept.
VGreen: A System for Energy Efficient Manager in Virtualized Environments G. Dhiman, G Marchetti, T Rosing ISLPED 2009.
Kyushu University Las Vegas, June 2007 The Effect of Nanometer-Scale Technologies on the Cache Size Selection for Low Energy Embedded Systems.
Improving Energy Efficiency of Configurable Caches via Temperature-Aware Configuration Selection Hamid Noori †, Maziar Goudarzi ‡, Koji Inoue ‡, and Kazuaki.
1 Dynamically Heterogeneous Cores Through 3D Resource Pooling Houman Homayoun Vasileios Kontorinis Amirali Shayan Ta-Wei Lin Dean M. Tullsen Speaker: Houman.
UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.
Reduction of Register File Power Consumption Approach: Value Lifetime Characteristics - Pradnyesh Gudadhe.
Copyright © 2010 Houman Homayoun Houman Homayoun National Science Foundation Computing Innovation Fellow Department of Computer Science University of California.
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.
1 Efficient System-on-Chip Energy Management with a Segmented Counting Bloom Filter Mrinmoy Ghosh- Georgia Tech Emre Özer- ARM Ltd Stuart Biles- ARM Ltd.
Sunpyo Hong, Hyesoon Kim
Cache Miss-Aware Dynamic Stack Allocation Authors: S. Jang. et al. Conference: International Symposium on Circuits and Systems (ISCAS), 2007 Presenter:
The Effect of Data-Reuse Transformations on Multimedia Applications for Application Specific Processors N. Vassiliadis, A. Chormoviti, N. Kavvadias, S.
Graduate Seminar Using Lazy Instruction Prediction to Reduce Processor Wakeup Power Dissipation Houman Homayoun April 2005.
Hang Zhang1, Xuhao Chen1, Nong Xiao1,2, Fang Liu1
Optimizing Distributed Actor Systems for Dynamic Interactive Services
Dynamic Scheduling Why go out of style?
Selective Code Compression Scheme for Embedded System
Lynn Choi School of Electrical Engineering
CGRA Express: Accelerating Execution using Dynamic Operation Fusion
Rahul Boyapati. , Jiayi Huang
Ann Gordon-Ross and Frank Vahid*
An Automated Design Flow for 3D Microarchitecture Evaluation
Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project
Presentation transcript:

1 RELOCATE Register File Local Access Pattern Redistribution Mechanism for Power and Thermal Management in Out-of-Order Embedded Processor Houman Homayoun, Aseem Gupta, Avesta Sasan, Alex Veidenbaum, Nikil Dutt, Fadi Kurdahi University of California Irvine

2 Outline Motivation Background study Study of Register file Underutilization Study of Register file default access patterns Access concentration and activity redistribution to relocate register file access patterns Results

3 Why Register File? RF is one of the hottest units in a processor –A small, heavily multi-ported SRAM –Accessed very frequently Example: IBM PowerPC 750FX

4 Why Temperature? Higher power densities (Watt per mm2) lead to higher operating temperatures, which (i) Increase the probability of timing violations (ii) Reduce IC lifetime (iii) Lower operating frequency (iv) Increase leakage power (v) Require expensive cooling mechanisms (vi) Overall increase in design effort and cost

5 Prior Work: Activity Migration Reduces temperature by migrating the activity to a replicated unit. –requires a replicated unit large area overhead –leads to a large performance degradation AMAM+PG

6 Conventional Register Renaming Register RenamerRegister allocation-release Physical registers are allocated/released in a somewhat random order

7 Analysis of Register File Operation 1.Register File Occupancy MiBenchSPECint2K

8 Performance Degradation with a Smaller Register File MiBenchSPECint2K

9 Analysis of Register File Operation 2. Register File Access Distribution –Coefficient of variation (CV) shows a “deviation” from average # of accesses for individual physical registers. na i is the number of accesses to a physical register i during a specific period (10K cycles). na is the average N, the total number of physical registers

10 Coefficient of Variation MiBenchSPEC2K

11 Register File Operation Underutilization which is distributed uniformly while only a small number of registers are occupied at any given time, the total accesses are uniformly distributed over the entire physical register file during the course of execution

12 RELOCATE: Access Redistribution within a Register File The goal is to “concentrate” accesses within a partition of a RF (region) –Some regions will be idle (for 10K cycles) Can power-gate them and allow to cool down register activity (a) baseline, (b) in-order (c) distant patterns

13 An Architectural Mechanism to Support Access Redistribution Active partition: a register renamer partition currently used in register renaming Idle partition: a register renamer partition which does not participate in renaming Active region: a region of the register file corresponding to a register renamer partition (whether active or idle) which has live registers Idle region: a region of the register file corresponding to a register renamer partition (whether active or idle) which has no live registers

14 Activity Migration without Replication An access concentration mechanism allocates registers from only one partition This default active partition (DAP) may run out of free registers before the 10K cycle “convergence period” is over –another partition (according to some algorithm) is then activated (referred to as additional active partitions or AAP ) –To facilitate physical register concentration in DAP, if two or more partitions are active and have free registers, allocation is performed in the same order in which partitions were activated.

15 The Access Concentration Mechanism Partition activation order is

16 The redistribution mechanism The default active partition is changed once every N cycles to redistribute the activity within the register file (according to some algorithm) –Once a new default partition (NDP) is selected, all active partitions (DAP+AAP) become idle. The idle partitions do not participate in register renaming, but their corresponding RF regions may have to be kept active (powered up) –A physical register in an idle partition may be live An idle RF region is power gated when its active list becomes empty.

17 The redistribution mechanism

18 Performance Impact? There is a two-cycle delay to wakeup a power gated physical register region The register renaming occurs in the front end of the microprocessor pipeline whereas the register access occurs in the back end. –There is a delay of at least two pipeline stages between renaming and accessing a physical register file –Can wake up the requested region in time Can wake up a required register file region without incurring a performance penalty at the time of access

19 Experimental setup MASE (SimpleScalar 4.0) –Model MIPS-74K processor, 800 MHz MiBench and SPECint2K benchmarks compiled with Compaq compiler, -O4 flag Industrial memory compiler used –64-entry, 64bit single-ended SRAM memory in TSMC 45nm technology HotSpot to estimate thermal profiles

20

21 Results Mibench RF power reduction

22 SPEC2K RF power reduction

23 Analysis of Power Reduction Increasing the number of RF partitions provides more opportunity to capture and cluster unmapped registers to a partition –Indicates that wakeup overhead is amortized for a larger number of partitions. Some exceptions –the overall power overhead associated with waking up an idle region becomes larger as the number of partition increases. –frequent but ineffective power gating and its overhead as the number of partition increases

24 Peak Temperature Reduction

25 Analysis of Temperature Reduction Increasing the number of partitions results in larger power density in each partition because RF access activity is concentrated in a smaller partition –While capturing more idle partitions and power gating them may potentially result in higher power reduction, larger power density due to smaller partition size results in overall higher temperature

26 Conclusions Showed Register File Underutilization Studied Register file default access patterns Propose access concentration and activity redistribution to relocate register file accesses Results show a noticeable power and temperature reduction in the RF RELOCATE technique can be applied when units are underutilized –as opposed to activity migration, which requires replication