Copyright © 2010 Houman Homayoun Houman Homayoun National Science Foundation Computing Innovation Fellow Department of Computer Science University of California.

Slides:

Advertisements

Similar presentations

Defect Tolerance for Yield Enhancement of FPGA Interconnect Using Fine-grain and Coarse-grain Redundancy Anthony J. YuGuy G.F. Lemieux September 15, 2005.

Advertisements

Thank you for your introduction.

Reducing Leakage Power in Peripheral Circuits of L2 Caches Houman Homayoun and Alex Veidenbaum Dept. of Computer Science, UC Irvine {hhomayou,

LEVERAGING ACCESS LOCALITY FOR THE EFFICIENT USE OF MULTIBIT ERROR-CORRECTING CODES IN L2 CACHE By Hongbin Sun, Nanning Zheng, and Tong Zhang Joseph Schneider.

Virtual Exclusion: An Architectural Approach to Reducing Leakage Energy in Multiprocessor Systems Mrinmoy Ghosh Hsien-Hsin S. Lee School of Electrical.

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.

Exploiting Spatial Locality in Data Caches using Spatial Footprints Sanjeev Kumar, Princeton University Christopher Wilkerson, MRL, Intel.

This project and the research leading to these results has received funding from the European Community's Seventh Framework Programme [FP7/ ] under.

Citadel: Efficiently Protecting Stacked Memory From Large Granularity Failures Dec 15 th 2014 MICRO-47 Cambridge UK Prashant Nair - Georgia Tech David.

CSC 4250 Computer Architectures December 8, 2006 Chapter 5. Memory Hierarchy.

Defect Tolerance for Yield Enhancement of FPGA Interconnect Using Fine-grain and Coarse-grain Redundancy Anthony J. Yu August 15, 2005.

Defect Tolerance for Yield Enhancement of FPGA Interconnect Using Fine-grain and Coarse-grain Redundancy Anthony J. Yu August 15, 2005.

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science August 20, 2009 Enabling.

Chuanjun Zhang, UC Riverside 1 Low Static-Power Frequent-Value Data Caches Chuanjun Zhang*, Jun Yang, and Frank Vahid** *Dept. of Electrical Engineering.

Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 26: April 18, 2007 Et Cetera…

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

FPGA Defect Tolerance: Impact of Granularity Anthony YuGuy Lemieux December 14, 2005.

1 Balanced Cache:Reducing Conflict Misses of Direct-Mapped Caches through Programmable Decoders ISCA 2006,IEEE. By Chuanjun Zhang Speaker: WeiZeng.

1 Energy-efficiency potential of a phase-based cache resizing scheme for embedded systems G. Pokam and F. Bodin.

Dyer Rolan, Basilio B. Fraguela, and Ramon Doallo Proceedings of the International Symposium on Microarchitecture (MICRO’09) Dec /7/14.

Data Cache Prefetching using a Global History Buffer Presented by: Chuck (Chengyan) Zhao Mar 30, 2004 Written by: - Kyle Nesbit - James Smith Department.

Case Study - SRAM & Caches

1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian.

Prashant Nair Dae-Hyun Kim Moinuddin K. Qureshi

Roza Ghamari Bogazici University.  Current trends in transistor size, voltage, and clock frequency, future microprocessors will become increasingly susceptible.

Power Reduction for FPGA using Multiple Vdd/Vth

A Hardware-based Cache Pollution Filtering Mechanism for Aggressive Prefetches Georgia Institute of Technology Atlanta, GA ICPP, Kaohsiung, Taiwan,

Low-Power Wireless Sensor Networks

StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University.

Mohamed Hefeeda 1 School of Computing Science Simon Fraser University, Canada Video Streaming over Cooperative Wireless Networks Mohamed Hefeeda (Joint.

Dept. of Computer Science, UC Irvine

Copyright © 2008 UCI ACES Laboratory Kyoungwoo Lee 1, Aviral Shrivastava 2, Nikil Dutt 1, and Nalini Venkatasubramanian 1.

Drowsy Caches: Simple Techniques for Reducing Leakage Power Authors: ARM Ltd Krisztián Flautner, Advanced Computer Architecture Lab, The University of.

Exploiting Program Hotspots and Code Sequentiality for Instruction Cache Leakage Management J. S. Hu, A. Nadgir, N. Vijaykrishnan, M. J. Irwin, M. Kandemir.

Energy-Efficient Cache Design Using Variable-Strength Error-Correcting Codes Alaa R. Alameldeen, Ilya Wagner, Zeshan Chishti, Wei Wu,

Power Management in High Performance Processors through Dynamic Resource Adaptation and Multiple Sleep Mode Assignments Houman Homayoun National Science.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

1 Tuning Garbage Collection in an Embedded Java Environment G. Chen, R. Shetty, M. Kandemir, N. Vijaykrishnan, M. J. Irwin Microsystems Design Lab The.

A Robust Pulse-triggered Flip-Flop and Enhanced Scan Cell Design

RDIS: A Recursively Defined Invertible Set Scheme to Tolerate Multiple Stuck-At Faults in Resistive Memory Rami Melhem, Rakan Maddah and Sangyeun cho Computer.

Robust Low Power VLSI ECE 7502 S2015 Minimum Supply Voltage and Very- Low-Voltage Testing ECE 7502 Class Discussion Elena Weinberg Thursday, April 16,

Garo Bournoutian and Alex Orailoglu Proceedings of the 45th ACM/IEEE Design Automation Conference (DAC’08) June /10/28.

Yun-Chung Yang SimTag: Exploiting Tag Bits Similarity to Improve the Reliability of the Data Caches Jesung Kim, Soontae Kim, Yebin Lee 2010 DATE(The Design,

Multiple Sleep Mode Leakage Control for Cache Peripheral Circuits in Embedded Processors Houman Homayoun, Avesta Makhzan, Alex Veidenbaum Dept. of Computer.

Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.

LA-LRU: A Latency-Aware Replacement Policy for Variation Tolerant Caches Aarul Jain, Cambridge Silicon Radio, Phoenix Aviral Shrivastava, Arizona State.

Houman Homayoun, Sudeep Pasricha, Mohammad Makhzan, Alex Veidenbaum Center for Embedded Computer Systems, University of California, Irvine,

Energy Reduction for STT-RAM Using Early Write Termination Ping Zhou, Bo Zhao, Jun Yang, *Youtao Zhang Electrical and Computer Engineering Department *Department.

Implicit-Storing and Redundant- Encoding-of-Attribute Information in Error-Correction-Codes Yiannakis Sazeides 1, Emre Ozer 2, Danny Kershaw 3, Panagiota.

Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.

11 Online Computing and Predicting Architectural Vulnerability Factor of Microprocessor Structures Songjun Pan Yu Hu Xiaowei Li {pansongjun, huyu,

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

1 Efficient System-on-Chip Energy Management with a Segmented Counting Bloom Filter Mrinmoy Ghosh- Georgia Tech Emre Özer- ARM Ltd Stuart Biles- ARM Ltd.

1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.

Seok-jae, Lee VLSI Signal Processing Lab. Korea University

Cache Miss-Aware Dynamic Stack Allocation Authors: S. Jang. et al. Conference: International Symposium on Circuits and Systems (ISCAS), 2007 Presenter:

A Survey of Fault Tolerant Methodologies for FPGA’s Gökhan Kabukcu

University of Michigan Electrical Engineering and Computer Science 1 Low Cost Control Flow Protection Using Abstract Control Signatures Daya S Khudia and.

On Reliable Modular Testing with Vulnerable Test Access Mechanisms Lin Huang, Feng Yuan and Qiang Xu.

Improving Multi-Core Performance Using Mixed-Cell Cache Architecture

Alireza Shafaei, Shuang Chen, Yanzhi Wang, and Massoud Pedram

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Fine-Grain CAM-Tag Cache Resizing Using Miss Tags

A Novel Cache-Utilization Based Dynamic Voltage Frequency Scaling (DVFS) Mechanism for Reliability Enhancements *Yen-Hao Chen, *Yi-Lun Tang, **Yi-Yu Liu,

Communication Driven Remapping of Processing Element (PE) in Fault-tolerant NoC-based MPSoCs Chia-Ling Chen, Yen-Hao Chen and TingTing Hwang Department.

Lei Zhao, Youtao Zhang, Jun Yang

Restrictive Compression Techniques to Increase Level 1 Cache Capacity

Presentation transcript:

Copyright © 2010 Houman Homayoun Houman Homayoun National Science Foundation Computing Innovation Fellow Department of Computer Science University of California San Diego FFT-Cache: A Flexible Fault-Tolerant Cache Architecture for Ultra Low Voltage Operation

Copyright © 2010 Houman Homayoun Motivation  The failure rate of an SRAM cell increases exponentially when lowering Vdd  For near threshold voltages almost all of the cache sets and blocks become faulty  High amount of Conflicts between blocks in high bit failure rates  Need an efficient fault-tolerant method that can tolerate faulty blocks for such high fault rates CASES 2011 #2 A 64KB 4-way set associative L1 cache with 64B block size, 8b subblock size

Copyright © 2010 Houman Homayoun Related Work: Fault-tolerant Caches  Circuit-level Techniques 8T SRAM, 10T SRAM, ST SRAM, …  Error Detection/Correction Code Methods SECDED, DECTED,..  Architecture-level Techniques Cache-Resizing methods  Yield-Aware Cache  Wilkerson et al.( Word-disable and Bit-fix) CASES 2011 #3 These techniques are not efficient for high fault rates

Copyright © 2010 Houman Homayoun Our Goal  Design a very low power, fault-tolerant cache architecture that can detect and replicate memory faults arising from operation in the near- threshold region ( < 650mV )  Use a portion of faulty cache blocks (global blocks) as redundancy to tolerate other faulty blocks or lines  Categorize the cache lines based on the degree of conflict of their blocks to reduce the granularity of redundancy replacement  Use a flexible defect map with a simple and efficient algorithm to initiate and update it to minimize the non-functional cache area CASES 2011 #4

Copyright © 2010 Houman Homayoun Base Architecture C CASES 2011 #5 Each block is divided into multiple equally sized subblocks Each subblock is labeled faulty if it has at least one faulty bit Each block is labeled faulty if it has at least one faulty subblock Two blocks (lines) have a conflict if they have at least one faulty subblock (block) in the same position Bank 1Bank 2 Way 1Way 2 Way 3 Way 4 Way 1Way 2 Way 3 Way 4 Line (set) Block with 4 subblocks block-level conflict line-level conflict faulty block Min_faulty line No_conflict line (within blocks in line) Low_conflict line High_conflict line Maximum Global Block (MGB): threshold for determining minimum faulty line & low conflict line

Copyright © 2010 Houman Homayoun FFT-Cache Configuration  FDM Initialization Run memory BIST to characterize memory faults in low voltage mode Fill defect map entries based on BIST output  FDM Configuration Algorithm Categorize the FDM entries based on the degree of conflict:  Min_faulty  No_conflict  Low_conflict  High_conflict For lines of Min_faulty, set faulty blocks as Global Target block For lines of No_conflict, set one of its faulty blocks as Local Target block For lines of Low_conflict, try to find a Global Target block from other bank For lines of High_conflict, try to find a Global Target line from other bank CASES 2011 #6

Copyright © 2010 Houman Homayoun Proposed FFT-Cache  Three types of fault replication: Local Target Block Global Target Block Global Target Line CASES 2011 #7 Bank 1Bank 2 Way 1Way 2 Way 3 Way 4 Way 1Way 2 Way 3 Way 4 Lines with no conflict between inside blocks Lines with Low conflict between inside blocks Lines with High conflict between inside blocks Only 1 functional line

Copyright © 2010 Houman Homayoun FFT-Cache Architecture CASES 2011 #8 Added components: + Flexible Defect map (FDM) + MUXing layer Keeps Faulty Locations Info Same number of lines as banks MUXing Layer: Does the selection between different subblocks/blocks to create final fault-free block Base Architecture FFT Architecture

Copyright © 2010 Houman Homayoun Evaluation Methodology  Analytical Model Estimates the probability of failure of FFT-Cache  Experimental Setup Baseline Processor  Nehalem-based processor  64KB 4-way set associative L1 cache and 2MB 8-way L2 Monte Carlo Simulation using our FDM configuration algorithm  Identify the Vdd-min and portion of the cache that should be disabled while achieving a 99.9% yield Conf/Workshop-name date #9

Copyright © 2010 Houman Homayoun Analytical Model of Cache Failure CASES 2011 # % Yield FFT-Cache can reduce the Vdd below 375mv in comparison with 465mv and 520mv for DECTED and SECDED methods, respectively

Copyright © 2010 Houman Homayoun Experiment 1: Impact of FFT-Cache on Performance  Results of minimum voltage configuration on L1 & L2 (Vdd=375 mV and 16-bit subblock)  Performance drop due to: increasing in cache access delay (from 2 to 3 cycles for L1 and 20 to 22 cycles for L2) reduction in cache effective size (less than 25%) CASES 2011 #11 2.2% average performance drop for L1 and 1% for L2 Less than 4% Average Performance drop for both L1 and L2 Impact of extra cycle is more than cache size reduction IPC loss (%)

Copyright © 2010 Houman Homayoun Experiment 2: Area and Power Overheads FFT implemented on L1 & L2 using operating points earlier The power overhead is for high-power mode (nominal Vdd) Using 8T cells to protect the tag and defect map arrays in low-power mode CASES 2011 #12 Defect Map area is the major component of area overhead for both L1 & L2 Defect Map is the major source of Leakage Power in both L1 & L2 The main source of dynamic power in nominal Vdd relates to bypass MUXs L2 Overheads < L1 Overheads

Copyright © 2010 Houman Homayoun Remapping for Multi-Bank Memory  Impact of voltage scaling induced errors on the available cache capacity The available cache capacity increases with larger number of banks, since the opportunities for remapping increase Baseline tiled CMP architecture

Copyright © 2010 Houman Homayoun Remapping Policy  Adjacent mapping Moderate Latency Moderate Capacity Moderate Traffic  Global mapping Maximum Latency Maximum Capacity Maximum Traffic Adjacent mapping Global mapping

Copyright © 2010 Houman Homayoun Impact of Network Configuration  Power and performance results for various network configuration Need for a high performance network as voltage scales down

Copyright © 2010 Houman Homayoun Conclusion  We proposed FFT-Cache: a fault-tolerant cache architecture that achieves significant power consumption reduction through aggressive voltage scaling FFT-Cache uses a portion of faulty cache blocks (global blocks) as redundancy to tolerate other faulty blocks or lines  FFT-Cache has a flexible defect map and an efficient configuration algorithm that categorizes the cache lines based on degree of conflict between their blocks  Using our approach: Operational voltage of memory can be reduced to 375mV in 45 nm Tech  For large CMP architecture we need a high performance network to handle the large traffic induced by remapping. CASES 2011 #16

Copyright © 2010 Houman Homayoun Thank You! CASES 2011 #17

Copyright © 2010 Houman Homayoun Comparison with Recent Works CASES 2011 #18 Scheme Vdd-min (mV) L1 CacheL2 Cache Norm. IPC Area over. (%) Power over. (%) Area over. (%) Power over. (%) 6T cell ZerehCache Wilkerson Ansari T cell FFT-Cache FFT-Cache achieves the lowest operating voltage (375mv) and the lowest area and L1 power overhead