Copyright © 2006 UCI ACES Laboratory Kyoungwoo Lee 1, Aviral Shrivastava 2, Ilya Issenin 1, Nikil Dutt 1, and Nalini Venkatasubramanian.

Slides:



Advertisements
Similar presentations
Computer Organization and Architecture
Advertisements

Quantitative Analysis of Control Flow Checking Mechanisms for Soft Errors Aviral Shrivastava, Abhishek Rhisheekesan, Reiley Jeyapaul, and Carole-Jean Wu.
LEVERAGING ACCESS LOCALITY FOR THE EFFICIENT USE OF MULTIBIT ERROR-CORRECTING CODES IN L2 CACHE By Hongbin Sun, Nanning Zheng, and Tong Zhang Joseph Schneider.
CML Efficient & Effective Code Management for Software Managed Multicores CODES+ISSS 2013, Montreal, Canada Ke Bai, Jing Lu, Aviral Shrivastava, and Bryce.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.
1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.
Discussion of: “Terrestrial-based Radiation Upsets: A Cautionary Tale” CprE 583 Tony Kuker 12/06/05.
MURI Neutron-Induced Multiple-Bit Upset Alan D. Tipton 1, Jonathan A. Pellish 1, Patrick R. Fleming 1, Ronald D. Schrimpf.
April 30, Cost efficient soft-error protection for ASICs Tuvia Liran; Ramon Chips Ltd.
Microprocessor Reliability
Using Hardware Vulnerability Factors to Enhance AVF Analysis Vilas Sridharan RAS Architecture and Strategy AMD, Inc. International Symposium on Computer.
CML CML Presented by: Aseem Gupta, UCI Deepa Kannan, Aviral Shrivastava, Sarvesh Bhardwaj, and Sarma Vrudhula Compiler and Microarchitecture Lab Department.
® 1 Shubu Mukherjee, FACT Group Cache Scrubbing in Microprocessors: Myth or Necessity? Practical Experience Report Shubu Mukherjee Joel Emer, Tryggve Fossum,
Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &
CS 7810 Lecture 25 DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design T. Austin Proceedings of MICRO-32 November 1999.
Copyright © 2002 UCI ACES Laboratory A Design Space Exploration framework for rISA Design Ashok Halambi, Aviral Shrivastava,
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science August 20, 2009 Enabling.
A Delay-efficient Radiation-hard Digital Design Approach Using Code Word State Preserving (CWSP) Elements Charu Nagpal Rajesh Garg Sunil P. Khatri Department.
Page 1 CS Department Parallel Design of JPEG2000 Image Compression Xiuzhen Huang CS Department UC Santa Barbara April 30th, 2003.
Data Partitioning Techniques for Partially Protected Caches to Reduce Soft Error Induced Failures (DIPES 08) Kyoungwoo Lee.
Compilation Techniques for Energy Reduction in Horizontally Partitioned Cache Architectures Aviral Shrivastava, Ilya Issenin, Nikil Dutt Center For Embedded.
CML CML Cache Vulnerability Equations for Protecting Data in Embedded Processor Caches from Soft Errors † Aviral Shrivastava, € Jongeun Lee, † Reiley Jeyapaul.
Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from.
1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.
Automatic Tuning of Two-Level Caches to Embedded Applications Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.
1 Enhancing Random Access Scan for Soft Error Tolerance Fan Wang* Vishwani D. Agrawal Department of Electrical and Computer Engineering, Auburn University,
University of Michigan Electrical Engineering and Computer Science 1 A Microarchitectural Analysis of Soft Error Propagation in a Production-Level Embedded.
A Compiler-in-the-Loop (CIL) Framework to Explore Horizontally Partitioned Cache (HPC) Architectures Aviral Shrivastava*, Ilya Issenin, Nikil Dutt *Compiler.
Accuracy-Configurable Adder for Approximate Arithmetic Designs
Defining Anomalous Behavior for Phase Change Memory
McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures Runjie Zhang Dec.3 S. Li et al. in MICRO’09.
Assuring Application-level Correctness Against Soft Errors Jason Cong and Karthik Gururaj.
IVEC: Off-Chip Memory Integrity Protection for Both Security and Reliability Ruirui Huang, G. Edward Suh Cornell University.
Copyright © 2008 UCI ACES Laboratory Kyoungwoo Lee 1, Aviral Shrivastava 2, Nikil Dutt 1, and Nalini Venkatasubramanian 1.
Energy-Efficient Cache Design Using Variable-Strength Error-Correcting Codes Alaa R. Alameldeen, Ilya Wagner, Zeshan Chishti, Wei Wu,
1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin.
SiLab presentation on Reliable Computing Combinational Logic Soft Error Analysis and Protection Ali Ahmadi May 2008.
CML CML Compiler-Managed Protection of Register Files for Energy-Efficient Soft Error Reduction Jongeun Lee, Aviral Shrivastava* Compiler Microarchitecture.
Self-* Systems CSE 598B Paper title: Dynamic ECC tuning for caches Presented by: Niranjan Soundararajan.
Yun-Chung Yang SimTag: Exploiting Tag Bits Similarity to Improve the Reliability of the Data Caches Jesung Kim, Soontae Kim, Yebin Lee 2010 DATE(The Design,
Copyright © 2008 UCI ACES/DSM Laboratories 1 Nalini Venkatasubramanian 1 Kyoungwoo Lee,
Yun-Chung Yang TRB: Tag Replication Buffer for Enhancing the Reliability of the Cache Tag Array Shuai Wang; Jie Hu; Ziavras S.G; Dept. of Electr. & Comput.
Implicit-Storing and Redundant- Encoding-of-Attribute Information in Error-Correction-Codes Yiannakis Sazeides 1, Emre Ozer 2, Danny Kershaw 3, Panagiota.
11 Online Computing and Predicting Architectural Vulnerability Factor of Microprocessor Structures Songjun Pan Yu Hu Xiaowei Li {pansongjun, huyu,
Copyright © 2010 Houman Homayoun Houman Homayoun National Science Foundation Computing Innovation Fellow Department of Computer Science University of California.
Harnessing Soft Computation for Low-Budget Fault Tolerance Daya S Khudia Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan,
In-Place Decomposition for Robustness in FPGA Ju-Yueh Lee, Zhe Feng, and Lei He Electrical Engineering Dept., UCLA Presented by Ju-Yueh Lee Address comments.
Methodology to Compute Architectural Vulnerability Factors Chris Weaver 1, 2 Shubhendu S. Mukherjee 1 Joel Emer 1 Steven K. Reinhardt 1, 2 Todd Austin.
EnerJ: Approximate Data Types for Safe and General Low-Power Computation (PLDI’2011) Adrian Sampson, Werner Dietl, Emily Fortuna Danushen Gnanapragasam,
Static Analysis to Mitigate Soft Errors in Register Files Jongeun Lee, Aviral Shrivastava Compiler Microarchitecture Lab Arizona State University, USA.
Spring 2008 CSE 591 Compilers for Embedded Systems Aviral Shrivastava Department of Computer Science and Engineering Arizona State University.
A Novel, Highly SEU Tolerant Digital Circuit Design Approach By: Rajesh Garg Sunil P. Khatri Department of Electrical and Computer Engineering, Texas A&M.
Types of RAM (Random Access Memory) Information Technology.
Operation Tables for Scheduling in the presence of Partial Bypassing Aviral Shrivastava 1 Eugene Earlie 2 Nikil Dutt 1 Alex Nicolau 1 1 Center For Embedded.
EE 653: Group #3 Impact of Drowsy Caches on SER Arjun Bir Singh Mohammad Abdel-Majeed Sameer G Kulkarni.
Fault-Tolerant Resynthesis for Dual-Output LUTs Roy Lee 1, Yu Hu 1, Rupak Majumdar 2, Lei He 1 and Minming Li 3 1 Electrical Engineering Dept., UCLA 2.
Kyoungwoo Lee1, Aviral Shrivastava2, Ilya Issenin1,
Improving Multi-Core Performance Using Mixed-Cell Cache Architecture
Types of RAM (Random Access Memory)
SE-Aware HPC Extension : Selective Data Protection for reducing failures due to soft errors 7/20/2006 Kyoungwoo Lee.
Vladimir Stojanovic & Nicholas Weaver
nZDC: A compiler technique for near-Zero silent Data Corruption
Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &
Hwisoo So. , Moslem Didehban#, Yohan Ko
Ann Gordon-Ross and Frank Vahid*
Dynamic Prediction of Architectural Vulnerability
Partially Protected Caches to Reduce Failures Due to Soft Errors in Multimedia Applications Kyoungwoo Lee, Aviral Shrivastava, Ilya Issenin, Nikil Dutt,
Kyoungwoo Lee, Minyoung Kim, Nikil Dutt, and Nalini Venkatasubramanian
Automatic Tuning of Two-Level Caches to Embedded Applications
Presentation transcript:

Copyright © 2006 UCI ACES Laboratory Kyoungwoo Lee 1, Aviral Shrivastava 2, Ilya Issenin 1, Nikil Dutt 1, and Nalini Venkatasubramanian 3 Mitigating Soft Error Failures for Multimedia Applications by Selective Data Protection 1 ACES Lab. and 3 DSM Lab. University of California at Irvine 2 Compiler and Microarchitecture Lab. Arizona State University

Copyright © 2006 UCI ACES Laboratory CASES’06 #2 Soft Errors – Major Concern for Reliability  Soft Errors cause Failures  Transient faults in electronic devices  Program can crash, give wrong output, go into infinite loop etc.  Causes of Soft Errors  Poor system design  Random-noise or signal-integrity such as crosstalk  Radiations-induced  Alpha particles, neutrons, protons etc.  Dominant contributor to soft errors  Radiations can not be completely shielded  e.g. - neutron can pass through 5 feet of concrete Radiation-induced soft errors are dominant

Copyright © 2006 UCI ACES Laboratory CASES’06 #3 The Phenomenon of Radiation-Induced Soft Errors 0 1 sourcedrain Transistor Radiation Bit Value Bit Flip

Copyright © 2006 UCI ACES Laboratory CASES’06 #4 Impact of Soft Errors  Soft Error Rate (SER)  FIT: How many failures in one billion hours  Mean Time To Failure (MTTF)  Examples -  Cellphone with 4 Mbit of low-power 1,000 FIT per Mbit  MTTF = 28 years  Laptop PC with 256 MB of 600 FIT per Mbit  MTTF = 1 month  Router Farm with 100 Gbit of 600 FIT per Mbit  MTTF = 17 hours

Copyright © 2006 UCI ACES Laboratory CASES’06 #5 [Hazucha et al., IEEE] P. Hazucha and C. Svensson. Impact of CMOS Technology Scaling on the Atmospheric Neutron Soft Error Rate. IEEE Trans. on Nuclear Science, 47(6):2586–2594, Soft Errors on an Increase  Increase exponentially due to technology scaling  0.18 µ m  1,000 FIT per Mbit of SRAM  0.13 µ m  10,000 to 100,000 FIT per Mbit of SRAM  Voltage Scaling  Voltage scaling increases SER significantly Soft Error is a main design concern! SER  N flux CS x exp Q critical {- x QsQs } where Q critical = C V x

Copyright © 2006 UCI ACES Laboratory CASES’06 #6 Soft Errors in Caches are Important  Soft errors in memory are much more important than in combinational logic  Strong temporal masking in combinational logic  Most upsets in memory manifest as soft errors  Only 11 % of Soft Errors in combinational logic  Redundancy techniques are popular for Memories  ECC-based solutions  Not applicable for caches  Very sensitive to performance and power overheads  Caches are most vulnerable to soft errors  Caches occupy majority area in processors (can be more than 50 %) Intel Itanium II (0.18 um) – More than 50 % Area Need to minimize failures due to Soft Errors in Caches

Copyright © 2006 UCI ACES Laboratory CASES’06 #7 ECC Protection  ECC (Error Correcting Codes) is popular technique to protect memory from soft errors  But has high overheads in terms of Area, Performance and Power  e.g., SEC-DED - Hamming Code (32, 6)  Performance by up to 95 %  [Li et al., MTDT ’05]  Energy by up to 22 %  [Phelan, ARM ’03]  Area by more than 18 %  [Phelan, ARM ’03] Coding Decoding Data Unprotected Cache Protected Cache ECC ECC protection for caches is expensive!

Copyright © 2006 UCI ACES Laboratory CASES’06 #8 Problem Statement  Dual Optimization  Reduce failures due to soft errors in caches  Minimize power and performance overheads

Copyright © 2006 UCI ACES Laboratory CASES’06 #9 Outline  Motivation and Problem Statement  Related Work  Our Solution  Experiments  Conclusion

Copyright © 2006 UCI ACES Laboratory CASES’06 #10 Related Work in Combating Soft Errors  Process Technology Solutions  Hardening: [Baze et al., IEEE Trans. On Nuclear Science ’00]  SOI: [O. Musseau, IEEE Trans. On Nuclear Science ‘96]  Process complexity, yield loss, and substrate cost  Microarchitectural Solutions for Caches  Cache Scrubbing: [Mukherjee et al., PRDC ’04]  Low Power Cache: [Li et al., ISLPED ’04]  Area Efficient Protection: [Kim et al., DATE ’06]  Multiple Bit Correction: [Neuberger et al., TODAES ’03]  Cache Size Selection: [Cai et al., ASP-DAC ’06]  High overheads in terms of power, performance, and area  Our Solution  Compiler-based Microarchitectural Technique  Provide protection from soft errors while minimizing the power, performance, and area overheads

Copyright © 2006 UCI ACES Laboratory CASES’06 #11 Outline  Motivation and Problem Statement  Related Work  Our Solution  Observation  Software Support  Architectural Support  Experiments  Conclusion

Copyright © 2006 UCI ACES Laboratory CASES’06 #12 Observation  Memory is divided into pages  Suppose you could protect pages from soft errors independently N x 1 KB page Application Data Memory 1000 Simulations Random Error Injection Number of Failures N KB 1 2 K N K

Copyright © 2006 UCI ACES Laboratory CASES’06 #13 Observation  For a multimedia application - susan  Failure: Application crashes, goes into infinite loop, broken header of image file, wrong size of image etc..  Loss in Quality of Service is not a failure All pages are not important!!

Copyright © 2006 UCI ACES Laboratory CASES’06 #14 Outline  Motivation and Problem Statement  Related Work  Our Solution  Observation  Software Support  Architectural Support  Experiments  Conclusion

Copyright © 2006 UCI ACES Laboratory CASES’06 #15 Data Partitioning  Failure Critical (FC) data  Loop bounds, loop iterators, branch decision variables etc…  An error may result in a failure  Failure Non Critical (FNC) data  Multimedia data (e.g. image pixel bits)  An error may not cause failures  Only loss in QoS … if ( condition ) { for ( loop = 1; loop < 64 ; loop++ ) { local = MM[loop] / ( 2*constant ); MM[loop] = min( 127, max( -127, MM[loop] ) ); } …  Our Approach for Multimedia Applications  Simple Data Partitioning  All multimedia data is FNC  Everything else is FC  User marks the FNC (multimedia) data  Very simple to do sample code (FNC, FC)

Copyright © 2006 UCI ACES Laboratory CASES’06 #16 Composition of FC and FNC data 54 % On average 50% pages are FNC Should be able to reduce ECC overheads by half

Copyright © 2006 UCI ACES Laboratory CASES’06 #17 Outline  Motivation and Problem Statement  Related Work  Our Solution  Observation  Software Support  Architectural Support  Experiments  Conclusion

Copyright © 2006 UCI ACES Laboratory CASES’06 #18 HPC (Horizontally Partitioned Caches)  HPC  More than one cache at the same level of hierarchy  Each page in memory is mapped to exactly one cache  Originally proposed to separate stack data and array data  Performance Improvements  But also very effective in reducing energy consumption [Shrivastava et al., CASES’05]  Performance improvements  Mini Cache is typically smaller than Main Cache Processor Pipeline HPC Memory Processor (e.g.: Intel XScale) Memory Controller Main Cache Mini Cache Page Mapping

Copyright © 2006 UCI ACES Laboratory CASES’06 #19 Memory FNCFC Main Cache Mini Cache PPC (Partially Protected Caches)  We propose  Partially Protected Caches  Main Cache  Mini Cache Protected from soft errors  Compiler maps data to the two caches  Map FNC to Unprotected Main Cache  Map FC to Protected Mini Cache  Intuition is to provide protection to only the FC data Processor Pipeline Unprotected Main Cache Protected Mini Cache HPC Processor Memory Controller Page Mapping PPC FNC FC

Copyright © 2006 UCI ACES Laboratory CASES’06 #20 Outline  Motivation  Related Work  Partially Protected Caches and Selective Data Protection  Experiments  Experimental Framework  Results  Conclusion

Copyright © 2006 UCI ACES Laboratory CASES’06 #21 Data Cache Configurations Unprotected Cache  Configuration 1 - Unsafe Cache Configuration High Failures High Performance Low Energy  Configuration 2 - Safe Cache Configuration Low Failures Low Performance High Energy  Configuration 3 - PPC Cache Configuration Low Failures High Performance Low Energy Unprotected Cache Protected Cache Coding Decoding ECC Prot. Cache Traditional Proposed Traditional

Copyright © 2006 UCI ACES Laboratory CASES’06 #22 MiBench MediaBench Experimental Framework Application (MiBench etc) Compiler (gcc) ExecutablePage Mapping Image: SUSAN Audio: ADPCM, G.721 Video: H.263 FNC FC No Protection UNSAFE FNC FC FNC FC Protection SAFE PPCPPC Selective Protection Cache Simulator (SimpleScalar) CACTI Synthesis (Synopsys) Accelerated Soft Error Injection Hamming Code REPORT : Failure Rate Runtime Energy Multimedia Data informed

Copyright © 2006 UCI ACES Laboratory CASES’06 #23 Experimental Results 1  Effectiveness of our approach - Selective Data Protection using PPC architecture  Data Cache similar to Intel XScale  Unsafe: 32 KB (no protection) data cache  Safe: 32 KB (protection) data cache  PPC: 32 KB (no protection) & 2KB (protection) data caches  Data Cache Configuration  32 bytes line size, 4 way set-assoc, and FIFO  Soft Error Injection  Randomly inject Soft Errors every cycle if data in cache is valid  Accelerated Soft Error Rate (SER)  Base SER = 1e-9 per cycle per 1 KB of data cache  Multiple-Bit Errors (MBE) and Single-Bit Errors (SBE)  SER for MBE is 100 times less than SER for SBE  Metrics  Reliability in terms of Failure Rates  Number of failures in 1,000 runs  Performance  System Performance : Number of processor cycles + Data Cache accesses + main memory accesses  Energy Consumption  System energy : Processor energy + Data Cache energy (Protected one and Unprotected one) + main memory bus energy + main memory access energy

Copyright © 2006 UCI ACES Laboratory CASES’06 #24 Failure Rate  Normalized Failure Rate : Ratio of failure rate for each configuration to that of Unsafe configuration Failure Rate of PPC is close to that of Unsafe

Copyright © 2006 UCI ACES Laboratory CASES’06 #25 Performance Our paper in CASES ’06 has more conservative numbers due to a mistake of performance calculations for a couple of benchmarks.  Normalized Runtime : Ratio of runtime for each configuration to that of Unsafe configuration PPC has performance close to Unsafe On average, PPC has 32 % runtime reduction compared to Safe PPC has only 1 % performance overhead compared to Unsafe

Copyright © 2006 UCI ACES Laboratory CASES’06 #26 Energy Consumption  Normalized Energy Consumption : Ratio of energy consumption for each configuration to that of Unsafe configuration PPC has energy consumption close to Unsafe On average, PPC has 29 % energy reduction compared to Safe PPC has 10 % energy consumption overhead compared to Unsafe

Copyright © 2006 UCI ACES Laboratory CASES’06 #27 Experimental Results 2  Design Space Exploration  Various Cache Configurations  Impact of Cache Size: 512 Bytes to 32 KB in exponents of 2  Set Associativity: directed-map, 4 way, 32 way  Metrics  Reliability in terms of Failure Rates  Performance  Energy Consumption

Copyright © 2006 UCI ACES Laboratory CASES’06 #28 Results 2: Design Space Exploration  Failure rate of PPC is close to that of Safe  Performance and energy consumption of PPC are close to those of Unsafe PPC can hold failure rate, performance, and power between Safe and Unsafe

Copyright © 2006 UCI ACES Laboratory CASES’06 #29 Conclusion  Soft Errors are major design concern for system reliability  We propose the Partially Protected Caches and the Selective Data Protection for Multimedia Applications  Our approach as compared to the Safe configuration  Comparable failure rates  32 % performance improvement  29 % energy saving  Our approach works across cache configurations  Future Work  Selective Data Protection for general applications  Selective Data Protection in other components such as logic

Copyright © 2006 UCI ACES Laboratory Thanks! Any Questions?

Copyright © 2006 UCI ACES Laboratory Backup Slides

Copyright © 2006 UCI ACES Laboratory CASES’06 #32 Focus  Who is your audience? CASES ’06 – compiler people  High-level presentation, more focus on compiler approaches  What is the strong motivation? Soft Error and its high- overhead protection  Specific motivation  What is our problem? Soft Error Protection in Caches for Multimedia Applications  What is our contribution? Selective Data Protection without Losing Reliability  Key Idea  Clear experimental framework  Every slide should have at least a tiny picture, which can visualize and help the contents.  Unified $ and HPC $

Copyright © 2006 UCI ACES Laboratory CASES’06 #33 Outline  Soft Errors in Cache  ECC protection: High Overheads in terms of Performance and Power  Selective Data Protection in HPC  Reduce Overheads with Comparable Reliability  Multimedia Applications  Experiments  Experimental Framework  Results  Conclusion

Copyright © 2006 UCI ACES Laboratory CASES’06 #34 Strong Motivation  Soft error is critical  ECC is expensive  All data are not equally critical to failures  Multimedia is a good example

Copyright © 2006 UCI ACES Laboratory CASES’06 #35 Radiation-Induced Soft Errors 0 1 sourcedrain Transistor Radiation Bit Value Bit Flip

Copyright © 2006 UCI ACES Laboratory CASES’06 #36 Soft Error R. Mastipuram and E. C. Wee. Soft Errors’ Impact on System Reliability. EDN online, Sep 2004

Copyright © 2006 UCI ACES Laboratory CASES’06 #37 Soft Errors vs. Hard Errors  Soft Errors vs. Hard Errors  Randomly radiation-induced Single Event Effects (SEE)  Transient faults vs. Permanent faults  Probability of soft errors is up to 100x higher than that of hard errors

Copyright © 2006 UCI ACES Laboratory CASES’06 #38 SER formula  N flux - intensity of the Neutron Flux  CS - the area of the cross section of the node  Q S - the charge collection efficiency  Q critical - the min charge required for a cell to retain data  Q cirtical = C x V where C is Capacitance and V is Supply Voltage SER  N flux CS x exp Q critical {- x QsQs }

Copyright © 2006 UCI ACES Laboratory CASES’06 #39 Soft Error is Critical  High Integration  High integration raises soft errors potentially [Mastipuram et al., EDN ’04]  (e.g.) Cellphone with 4 Mbit of low-power SRAM : 1,000 FIT per Mbit  28 years in MTTF  (e.g.) Laptop PC with 256 MB of DRAM : 600 FIT per Mbit  one month in MTTF  (e.g.) Router Farm with 100 Gbit of SRAM : 600 FIT per Mbit  17 hours in MTTF [Mastipuram et al., EDN ’04] R. Mastipuram and E. C. Wee. Soft Errors’ Impact on System Reliability. EDN online, Sep 2004.

Copyright © 2006 UCI ACES Laboratory CASES’06 #40 [Hazucha et al., IEEE] P. Hazucha and C. Svensson. Impact of CMOS Technology Scaling on the Atmospheric Neutron Soft Error Rate. IEEE Trans. on Nuclear Science, 47(6):2586–2594, Soft Errors on an Increase  Increase exponentially due to technology scaling  0.18 µ m  1,000 FIT per Mbit of SRAM  0.13 µ m  10,000 to 100,000 FIT per Mbit of SRAM  Voltage Scaling  Voltage scaling increases SER significantly SER  N flux CS x exp Q critical {- x QsQs } where Q critical = C V x Soft Error is a main design concern!

Copyright © 2006 UCI ACES Laboratory CASES’06 #41 [Hazucha et al., IEEE] P. Hazucha and C. Svensson. Impact of CMOS Technology Scaling on the Atmospheric Neutron Soft Error Rate. IEEE Trans. on Nuclear Science, 47(6):2586–2594, Soft Errors increase with technology advances  Soft errors are affected by [Hazucha et al., IEEE] :  Process Technology  Shrinking increases SER exponentially  (e.g.) 1,000 FIT per Mbit of SRAM in 0.18 µ m  10,000 to 100,000 FIT per Mbit of SRAM in 0.13 µ m [Mastipuram et al., EDN ’04]  Voltage Scaling  Voltage scaling increases SER significantly sourcedrain 0.13 µm Transistor sourcedrain 0.18 µm Transistor C and V decrease SER  N flux CS x exp Q critical {- x QsQs } where Q critical = C V x Soft Error is a main design concern!

Copyright © 2006 UCI ACES Laboratory CASES’06 #42 [Hazucha et al., IEEE] P. Hazucha and C. Svensson. Impact of CMOS Technology Scaling on the Atmospheric Neutron Soft Error Rate. IEEE Trans. on Nuclear Science, 47(6):2586–2594, Soft Errors increase with technology advances  Soft errors are affected by [Hazucha et al., IEEE] :  Process Technology  Shrinking increases SER exponentially  (e.g.) 1,000 FIT per Mbit of SRAM in 0.18 µ m  10,000 to 100,000 FIT per Mbit of SRAM in 0.13 µ m [Mastipuram et al., EDN ’04]  Voltage Scaling  Voltage scaling increases SER significantly  Radiation intensity  Latitude and Altitude  (e.g.)10 to 100 times higher SER at flight than at ground sourcedrain 0.13 µm Transistor sourcedrain 0.18 µm Transistor C and V decrease SER  N flux CS x exp Q critical {- x QsQs } where Q critical = C V x Soft Error is a main design concern!

Copyright © 2006 UCI ACES Laboratory CASES’06 #43 sourcedrain 0.13 µm Transistor Soft Error is Critical  High Integration  Raises SE linearly  Process Technology  Shrinking decreases Q critical and increases SER exponentially  (e.g.) 1,000 FIT per Mbit of SRAM in 0.18 µ m  10,000 to 100,000 FIT per Mbit of SRAM in 0.13 µ m [Mastipuram et al., EDN ’04] sourcedrain 0.18 µm Transistor C and V decrease SER  N flux CS x exp Q critical {- x QsQs } where Q critical = C V x

Copyright © 2006 UCI ACES Laboratory CASES’06 #44 Soft Error is Critical R. Mastipuram and E. C. Wee. Soft Errors’ Impact on System Reliability. EDN online, Sep  High Integration  Raises SE linearly  Process Technology  Shrinking decreases Q critical and increases SER exponentially  Voltage Scaling  Voltage scaling decreases Q critical and increases SER exponentially SER  N flux CS x exp Q critical {- x QsQs } where Q critical = C V x

Copyright © 2006 UCI ACES Laboratory CASES’06 #45  High Integration  Raises SE linearly  Process Technology  Shrinking decreases Q critical and increases SER exponentially  Voltage Scaling  Voltage scaling decreases Q critical and increases SER exponentially  Latitude and Altitude  10 to 100 times higher SER at flight than at ground  (e.g.) Potentially Laptop PC with 256 MB of Memory on an airplane at 35,000 ft  5 hours MTTF [Mastipuram et al., EDN ‘ 04]  High Integration  Raises SE linearly  Process Technology  Shrinking decreases Q critical and increases SER exponentially  Voltage Scaling  Voltage scaling decreases Q critical and increases SER exponentially  Latitude and Altitude  10 to 100 times higher SER at flight than at ground Soft Error is Critical R. Mastipuram and E. C. Wee. Soft Errors’ Impact on System Reliability. EDN online, Sep month MTTF 5 hours MTTF SER  N flux CS x exp Q critical {- x QsQs } where Q critical = C V x Soft Error is a main design concern! N flux SER 5 hours MTTF 1 month MTTF

Copyright © 2006 UCI ACES Laboratory CASES’06 #46 Soft Errors in Caches are Important  Core : Combinational Logic  Robust structure  Masking (e.g.: logical, electrical, and temporal maskings)  Only 10 % of Soft Errors in combinational logic  Main Memory: DRAM  Upset of memory is not masked  SER is not increasing with technology generations  Cache: SRAM  Upset is not masked  SER is increasing significantly with technology generations  Most area of processor  Cache affects performance and power consumption significantly Robert Bauman, Soft Errors in Advanced Computer Systems in IEEE Design and Test of Computers 2005 S. Mitra, N. Seifert, M. Zhang, Q. Shi, and K. S. Kim, Robust System Design with Built-In Soft-Error Resilience, IEEE Computer 2005 Richard Loft, Supercomputing Challenges at the National Center for Atmospheric Research Intel Itanium II (0.18 um) - More than 50 % Area DRAM SER SRAM SER

Copyright © 2006 UCI ACES Laboratory CASES’06 #47 Most Effective Protection: ECC  ECC (Error Correcting Codes) - Information Redundancy  Code data and store extra control data  Decode data and detect/correct errors in data  High overheads in terms of Area, Performance and Power  (e.g.) SEC-DED (Single Error Correction and Double Error Detection) for cache (or SRAM) – Hamming Codes (32, 6)  Performance by up to 95 %  Energy by up to 22 %  Area by more than 18 % Coding Decoding Data J.-F. Li and Y.-J. Huang. An Error Detection and Correction Scheme for RAMs with Partial-Write Function. In MTDT’05, pages 115–120, R. Phelan. Addressing Soft Errors in ARM Core-based Designs. Technical report, ARM, Unprotected Cache Protected Cache Control ECC protection for every cache access is too expensive!

Copyright © 2006 UCI ACES Laboratory CASES’06 #48 ECC Protection for Caches is Expensive  ECC (Error Correcting Codes) is the most effective technique to protect memory from soft errors  ECC has high overheads in terms of Area, Performance and Power  (e.g.) SEC-DED – Hamming Codes (32, 6)  Performance by up to 95 % [Li et al., MTDT ’05]  Energy by up to 22 % [Phelan, ARM ’03]  Area by more than 18 % [Phelan, ARM ’03] Coding Decoding Data [Li et al., MTDT ’05] J.-F. Li and Y.-J. Huang. An Error Detection and Correction Scheme for RAMs with Partial-Write Function. In MTDT’05, pages 115–120, [Phelan, ARM ’03] R. Phelan. Addressing Soft Errors in ARM Core-based Designs. Technical report, ARM, Unprotected Cache Protected Cache ECC ECC protection for every cache access is expensive!

Copyright © 2006 UCI ACES Laboratory CASES’06 #49 Power PC 4

Copyright © 2006 UCI ACES Laboratory CASES’06 #50 Pentium 4

Copyright © 2006 UCI ACES Laboratory CASES’06 #51 Intel Duo

Copyright © 2006 UCI ACES Laboratory CASES’06 #52 Cache Miss Rates of FC and FNC data

Copyright © 2006 UCI ACES Laboratory CASES’06 #53 Benchmarks  MiBench  Image Processing: Susan Edges, Susan Corners, Susan Smoothing  Audio Codec: ADPCM Encoder/Decoder  Media Bench  Audio Codec: G.721 Encoder/Decoder  PeaCE (Ptolemy extension as Codesign Environment)  H.263 Video Encoder

Copyright © 2006 UCI ACES Laboratory CASES’06 #54 Failures  Can not open output of multimedia processing  No output  Incorrect output name  Wrong header  Different output size  Crash  Infinite Loop

Copyright © 2006 UCI ACES Laboratory CASES’06 #55 Performance  Unsafe  Num_Inst + Access*2 + Miss*25  Safe  Num_Inst + Access*3 + Miss*25  HPC  Num_Inst + 2*(Main_Access + Mini_Access) + 25*(Main_Miss + Mini_Miss)

Copyright © 2006 UCI ACES Laboratory CASES’06 #56 Energy Consumption  Energy consumption of the whole system  Processor Pipeline: 0.67 nJoules per cycle  Cache: nJoules from CACTI  Memory: nJoules per access  Off-chip bus  Tools: CACTI and Synopsys Design Compiler  E = {(A SEprone × E SEprone ) + A SEprotected ×(E SEprotected +E dec )+(W SEprotected ×E cod )} + {(M SEprone +M SEprotected )×(E bus +E mem )}+{(M SEprotected ×E cod ) + (R SEprotected × E dec )} + {E proc ×(A SEprotected +A SEprone )}

Copyright © 2006 UCI ACES Laboratory CASES’06 #57 Clear Problem Definition and Intensive Experiments  Problem should be clear and very specific  Selective Data Protection in HPC for Multimedia Applications  Our strength is experimental framework and extensive experiments  Detailed presentation about our simulation environments and benchmarks  Experimental sets  Effects of our approach in terms of power, performance and reliability  Design space exploration

Copyright © 2006 UCI ACES Laboratory CASES’06 #58 Problem Definition  Configurations  Unsafe  Safe  HPC  Our interest lies on mitigating failures due to soft errors, instead of decreasing soft errors

Copyright © 2006 UCI ACES Laboratory CASES’06 #59 Failure Rate PPC provides the comparable reliability to Safe On average, both have 45 times less failures than Unsafe  Normalized Failure Rate : Ratio of failure rate for each configuration to that of Unsafe configuration

Copyright © 2006 UCI ACES Laboratory CASES’06 #60 Performance Our paper in CASES ’06 has more conservative numbers due to a mistake of performance calculations for a couple of benchmarks. PPC removes performance overhead from Safe On average, PPC has 32 % runtime reduction compared to Safe PPC has only 1 % performance overhead compared to Unsafe  Normalized Runtime : Ratio of runtime for each configuration to that of Unsafe configuration

Copyright © 2006 UCI ACES Laboratory CASES’06 #61 Energy Consumption PPC has less energy overhead than Safe On average, PPC has 29 % energy reduction compared to Safe PPC has 10 % energy consumption overhead compared to Unsafe  Normalized Energy Consumption : Ratio of energy consumption for each configuration to that of Unsafe configuration

Copyright © 2006 UCI ACES Laboratory CASES’06 #62 Results 2: Design Space Exploration  Failure rate of PPC is close to that of Safe  Performance and energy consumption of PPC are close to those of Unsafe PPC can hold failure rate, performance, and power between Safe and Unsafe

Copyright © 2006 UCI ACES Laboratory CASES’06 #63 Runtime

Copyright © 2006 UCI ACES Laboratory CASES’06 #64 Failure Rate

Copyright © 2006 UCI ACES Laboratory CASES’06 #65 QoS  PSNR = 10LOG 10 (MAX 2 /MSE)  MSE : Mean Squared Error

Copyright © 2006 UCI ACES Laboratory CASES’06 #66 Results 2: Design Space Exploration  Failure Rate Failure Rates increasing and saturated Failure Rate of PPC is close to that of Safe

Copyright © 2006 UCI ACES Laboratory CASES’06 #67 Results 2: Design Space Exploration  Performance Performance of PPC is close to that of Unsafe (32 % reduction compared to Safe)

Copyright © 2006 UCI ACES Laboratory CASES’06 #68 Results 2: Design Space Exploration  Energy Consumption Miss rate reduction and high access power cost Energy consumption of PPC is located b/w Safe and Unsafe (24 % reduction compared to Safe)

Copyright © 2006 UCI ACES Laboratory CASES’06 #69 QoS

Copyright © 2006 UCI ACES Laboratory CASES’06 #70 Area

Copyright © 2006 UCI ACES Laboratory CASES’06 #71 Composite Metric  LOG(Failure_Rate) * Performance * Energy