Approximate Computing: (Old) Hype or New Frontier? Sarita Adve University of Illinois, EPFL Acks: Vikram Adve, Siva Hari, Man-Lap Li, Abdulrahman Mahmoud,

Slides:

Advertisements

Similar presentations

NC STATE UNIVERSITY 1 Assertion-Based Microarchitecture Design for Improved Fault Tolerance Vimal K. Reddy Ahmed S. Al-Zawawi, Eric Rotenberg Center for.

Advertisements

Fault-Tolerant Systems Design Part 1.

Using Hardware Vulnerability Factors to Enhance AVF Analysis Vilas Sridharan RAS Architecture and Strategy AMD, Inc. International Symposium on Computer.

4.1.5 System Management Background What is in System Management Resource control and scheduling Booting, reconfiguration, defining limits for resource.

Hardware Fault Recovery for I/O Intensive Applications Pradeep Ramachandran, Intel Corporation, Siva Kumar Sastry Hari, NVDIA Manlap (Alex) Li, Latham.

Wish Branches A Review of “Wish Branches: Enabling Adaptive and Aggressive Predicated Execution” Russell Dodd - October 24, 2006.

Yuanyuan ZhouUIUC-CS Architectural Support for Software Bug Detection Yuanyuan (YY) Zhou and Josep Torrellas University of Illinois at Urbana-Champaign.

2. Introduction to Redundancy Techniques Redundancy Implies the use of hardware, software, information, or time beyond what is needed for normal system.

7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.

2001 ©R.P.Martin Using Distributed Data Structures for Constructing Cluster-Based Servers Richard Martin, Kiran Nagaraja and Thu Nguyen Rutgers University.

Slipstream Processors by Pujan Joshi1 Pujan Joshi May 6 th, 2008 Slipstream Processors Improving both Performance and Fault Tolerance.

Page 1 Copyright © Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer.

University of Michigan Electrical Engineering and Computer Science 1 StageNet: A Reconfigurable CMP Fabric for Resilient Systems Shantanu Gupta Shuguang.

Concurrent Systems Architecture Group University of California, San Diego and University of Illinois at Urbana-Champaign Morph 9/21/98 Morph: Supporting.

Software-Based Online Detection of Hardware Defects: Mechanisms, Architectural Support, and Evaluation Kypros Constantinides University of Michigan Onur.

MSWAT: Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems Siva Kumar Sastry Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Byn Choi,

GPU-Qin: A Methodology For Evaluating Error Resilience of GPGPU Applications Bo Fang , Karthik Pattabiraman, Matei Ripeanu, The University of British.

Towards a Hardware-Software Co-Designed Resilient System Man-Lap (Alex) Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou University of.

4.x Performance Technology drivers – Exascale systems will consist of complex configurations with a huge number of potentially heterogeneous components.

1 Fault Tolerance in the Nonstop Cyclone System By Scott Chan Robert Jardine Presented by Phuc Nguyen.

SWAT: Designing Reisilent Hardware by Treating Software Anomalies Man-Lap (Alex) Li, Pradeep Ramachandran, Swarup K. Sahoo, Siva Kumar Sastry Hari, Rahmet.

Reconfigurable Caches and their Application to Media Processing Parthasarathy (Partha) Ranganathan Dept. of Electrical and Computer Engineering Rice University.

Computer Science Open Research Questions Adversary models –Define/Formalize adversary models Need to incorporate characteristics of new technologies and.

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

Understanding the Propagation of Hard Errors to Software and Implications for Resilient System Design M. Li, P. Ramachandra, S.K. Sahoo, S.V. Adve, V.S.

Introduction to Software Testing. Types of Software Testing Unit Testing Strategies – Equivalence Class Testing – Boundary Value Testing – Output Testing.

Eliminating Silent Data Corruptions caused by Soft-Errors Siva Hari, Sarita Adve, Helia Naeimi, Pradeep Ramachandran, University of Illinois at Urbana-Champaign,

Testing Testing Techniques to Design Tests. Testing:Example Problem: Find a mode and its frequency given an ordered list (array) of with one or more integer.

Fault-Tolerant Systems Design Part 1.

(C) 2003 Daniel SorinDuke Architecture Dynamic Verification of End-to-End Multiprocessor Invariants Daniel J. Sorin 1, Mark D. Hill 2, David A. Wood 2.

Title of Selected Paper: IMPRES: Integrated Monitoring for Processor Reliability and Security Authors: Roshan G. Ragel and Sri Parameswaran Presented by:

Using Likely Program Invariants to Detect Hardware Errors Swarup Kumar Sahoo, Man-Lap Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou.

Today’s Agenda  Reminder: HW #1 Due next class  Quick Review  Input Space Partitioning Software Testing and Maintenance 1.

SWAT: Designing Resilient Hardware by Treating Software Anomalies Lei Chen, Byn Choi, Xin Fu, Siva Hari, Man-lap (Alex) Li, Pradeep Ramachandran, Swarup.

Application-Aware SoftWare AnomalyTreatment (SWAT) of Hardware Faults Byn Choi, Siva Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Swarup Sahoo, Sarita.

Relyzer: Exploiting Application-level Fault Equivalence to Analyze Application Resiliency to Transient Faults Siva Hari 1, Sarita Adve 1, Helia Naeimi.

Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan D EPARTMENT OF E LECTRICAL AND C OMPUTER E NGINEERING T HE U NIVERSITY OF B RITISH C OLUMBIA.

Fault-Tolerant Systems Design Part 1.

Interrupt driven I/O. MIPS RISC Exception Mechanism The processor operates in The processor operates in user mode user mode kernel mode kernel mode Access.

Fault Tolerance Benchmarking. 2 Owerview What is Benchmarking? What is Dependability? What is Dependability Benchmarking? What is the relation between.

SWAT: Designing Resilient Hardware by Treating Software Anomalies Byn Choi, Siva Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Swarup Sahoo, Sarita Adve,

Preserving Application Reliability on Unreliable Hardware Siva Hari Department of Computer Science University of Illinois at Urbana-Champaign.

Harnessing Soft Computation for Low-Budget Fault Tolerance Daya S Khudia Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan,

Methodology to Compute Architectural Vulnerability Factors Chris Weaver 1, 2 Shubhendu S. Mukherjee 1 Joel Emer 1 Steven K. Reinhardt 1, 2 Todd Austin.

A Binary Agent Technology for COTS Software Integrity Anant Agarwal Richard Schooler InCert Software.

Low-cost Program-level Detectors for Reducing Silent Data Corruptions Siva Hari †, Sarita Adve †, and Helia Naeimi ‡ † University of Illinois at Urbana-Champaign,

EnerJ: Approximate Data Types for Safe and General Low-Power Computation (PLDI’2011) Adrian Sampson, Werner Dietl, Emily Fortuna Danushen Gnanapragasam,

Evaluating the Fault Tolerance Capabilities of Embedded Systems via BDM M. Rebaudengo, M. Sonza Reorda Politecnico di Torino Dipartimento di Automatica.

1 ROGUE Dynamic Optimization Framework Using Pin Vijay Janapa Reddi PhD. Candidate - Electrical And Computer Engineering University of Colorado at Boulder.

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Efficient Soft Error.

Software Managed Resiliency Siva Hari Lei Chen, Xin Fu, Pradeep Ramachandran, Swarup Sahoo, Rob Smolenski, Sarita Adve Department of Computer Science University.

Preserving Application Reliability on Unreliable Hardware Siva Hari Adviser: Sarita Adve Department of Computer Science University of Illinois at Urbana-Champaign.

Resource Optimization for Publisher/Subscriber-based Avionics Systems Institute for Software Integrated Systems Vanderbilt University Nashville, Tennessee.

GangES: Gang Error Simulation for Hardware Resiliency Evaluation Siva Hari 1, Radha Venkatagiri 2, Sarita Adve 2, Helia Naeimi 3 1 NVIDIA Research, 2 University.

University of Michigan Electrical Engineering and Computer Science 1 Low Cost Control Flow Protection Using Abstract Control Signatures Daya S Khudia and.

Introduction to Machine Learning, its potential usage in network area,

A Tale of Two Erasure Codes in HDFS

Kernel Code Coverage Nilofer Motiwala Computer Sciences Department

MSWAT: Hardware Fault Detection and Diagnosis for Multicore Systems

nZDC: A compiler technique for near-Zero silent Data Corruption

SWAT: Designing Resilient Hardware by Treating Software Anomalies

Daya S Khudia, Griffin Wright and Scott Mahlke

Hwisoo So. , Moslem Didehban#, Yohan Ko

Douglas Lacy & Daniel LeCheminant CS 252 December 10, 2003

NEMESIS: A Software Approach for Computing in Presence of Soft Errors

InCheck: An In-application Recovery Scheme for Soft Errors

Hardware Counter Driven On-the-Fly Request Signatures

Hardware Assisted Fault Tolerance Using Reconfigurable Logic

Co-designed Virtual Machines for Reliable Computer Systems

Sculptor: Flexible Approximation with

Presentation transcript:

Approximate Computing: (Old) Hype or New Frontier? Sarita Adve University of Illinois, EPFL Acks: Vikram Adve, Siva Hari, Man-Lap Li, Abdulrahman Mahmoud, Helia Naemi, Pradeep Ramachandran, Swarup Sahoo, Radha Venkatagiri, Yuanyuan Zhou

What is Approximate Computing? Trading output quality for resource management? Exploiting applications’ inherent ability to tolerate errors? Old? GRACE [ ] rsim.cs.illinois.edu/grace SWAT [2006- ] rsim.cs.illinois.edu/swat Many others Or something new? Approximate computing through the lens of hardware resiliency … 2

Motivation 3 Redundancy Overhead (perf., power, area) Reliability Goal: High reliability at low-cost

SWAT: A Low-Cost Reliability Solution Need handle only hardware faults that affect software  Watch for software anomalies (symptoms) – Zero to low overhead “always-on” monitors Diagnose cause after anomaly detected and recover − May incur high overhead, but invoked infrequently  SWAT: SoftWare Anomaly Treatment 4

SWAT Framework Components Detection: Monitor symptoms of software misbehavior Diagnosis: Rollback/replay on multicore Recovery: Checkpoint/rollback or app-specific action on acceptable errors Repair/reconfiguration: Redundant, reconfigurable hardware Flexible control through firmware FaultErrorAnomaly detected Recovery DiagnosisRepair Checkpoint 5

SWAT Framework Components Detection: Monitor symptoms of software misbehavior Diagnosis: Rollback/replay on multicore Recovery: Checkpoint/rollback or app-specific action on acceptable errors Repair/reconfiguration: Redundant, reconfigurable hardware Flexible control through firmware FaultErrorAnomaly detected Recovery DiagnosisRepair Checkpoint 6 How to bound silent (escaped) errors? When is an error acceptable (at some utility) or unacceptable? How to trade silent errors with resources? How to associate recovery actions with errors?

Advantages of SWAT Handles only errors that matter – Oblivious to masked error, low-level failure modes, software-tolerated errors Low, amortized overheads – Optimize for common case, exploit software reliability solutions Customizable and flexible – Firmware control can adapt to specific reliability needs Holistic systems view enables novel solutions – Software-centric synergistic detection, diagnosis, recovery solutions Beyond hardware reliability – Long term goal: unified system (HW+SW) reliability – Systematic trade off between resource usage, quality, reliability 7

SWAT Contributions (for in-core hw faults) In-situ Diagnosis [DSN’08] Trace-based µarch diagnosis Fault Detection Very low cost detectors [ASPLOS’08, DSN’08] Bound silent errors [ASPLOS’12, DSN’12] Identify error outcomes: Relyzer, GangES [ISCA’14] Diagnosis FaultErrorSymptom detected Recovery Repair Chkpt Error modeling SWAT-Sim [HPCA’09] FPGA-based [DATE’12] mSWAT [MICRO’09] Multicore detection & diagnosis Fault Recovery Handling I/O [TACO’15] 8 Complete solution for in-core faults, evaluated for variety of workloads

SWAT Fault Detection Simple monitors observe anomalous SW behavior [ASPLOS’08, MICRO’09] Very low hardware area, performance overhead Fatal Traps Division by zero, RED state, etc. Kernel Panic OS enters panic state due to fault High OS High contiguous OS activity Hangs Simple HW hang detector App Abort App abort due to fault Out of Bounds Flag illegal addresses 9

Evaluating SWAT Detectors So Far Full-system simulation with out-of-order processor – Simics functional + GEMS timing simulator – Single core, multicore, distributed client-server Apps: Multimedia, I/O intensive, & compute intensive – Errors injected at different points in app execution µarch-level error injections (single error model) – Stuck-at, transient errors in latches of 8 µarch units – ~48,000 total errors 10

Error Outcomes 11 APPLICATION Output Symptom APPLICATION Output Transient error APPLICATION Output Symptom detectors (SWAT): Fatal traps, kernel panic, etc. Output Masked Silent Data Corruption (SDC) Detection Acceptable Unacceptable (w/ some quality)

SWAT SDC Rates SWAT detectors effective <0.2% of injected µarch errors give unacceptable SDCs (109/48,000 total) BUT hard to sell empirically validated, rarely incorrect systems 12

Challenges 13 Redundancy Overhead (perf., power, area) Reliability How? Tunable reliability SWAT Very high reliability at low-cost Goals: Full reliability at low-cost Accurate reliability evaluation Tunable reliability vs. quality vs. overhead Goals: Full reliability at low-cost Accurate reliability evaluation Tunable reliability vs. quality vs. overhead

Research Strategy Towards an Application Resiliency Profile – For a given instruction what is the outcome of an error? – For now, focus on a transient error in single bit of register Convert SDCs into (low-cost) detections for full reliability and quality OR let some acceptable or unacceptable SDCs escape – Quantitative tuning of reliability vs. overhead vs. quality 14

Challenges and Approach Determine error outcomes for all application-sites Cost-effectively convert (some) SDCs to Detections 15 APPLICATION.... Output SDC-causing error APPLICATION. Error Detection Error Detectors Challenges: What to use? Where to place? How to tune? How? Complete app resiliency evaluation How? Challenge: Analyze all errors with few injections Impractical, too many injections >1,000 compute-years for one app

APPLICATION Output Relyzer: Application Resiliency Analyzer [ASPLOS’12] 16 Prune error sites Application-level error (outcome) equivalence Predict error outcome if possible Inject errors for remaining sites Equivalence Classes Pilots Relyzer Can list virtually all SDC-causing instructions

Def to First-Use Equivalence Fault in first use is equivalent to fault in def  prune def r1 = r2 + r3 r4 = r1 + r5 … If there is no first use, then def is dead  prune def 17 Def First use

Control Flow Equivalence Insight: Errors flowing through similar control paths may behave similarly 18 CFG X Errors in X that take path behave similarly Heuristic: Use direction of next 5 branches

Store Equivalence Insight: Errors in stores may be similar if stored values are used similarly Heuristic to determine similar use of values: – Same number of loads use the value – Loads are from same PCs 19 Store Memory PC1PC2 Load Store PC1 PC2 Load Instance 1 Instance 2 PC

Relyzer Relyzer: A tool for complete application resiliency analysis – Employs systematic error pruning using static & dynamic analysis – Currently lists outcomes of masked, detection, SDC 3 to 6 orders of magnitude fewer error injections for most apps – 99.78% error sites pruned, only 0.004% sites represent 99% of all sites – GangES speeds up error simulations even further [ISCA’14] Can identify virtually all SDC causing error sites What about unacceptable vs. acceptable SDC? Ongoing 20 Relyzer

Can Relyzer Predict if SDC is Acceptable? Pilot = Acceptable SDC  All faults in class = Acceptable SDCs ??? Pilot = Acceptable SDC with quality Q  All faults in class = Acceptable SDCs with quality Q ??? 21 Equivalence Classes Pilots

PRELIMINARY Results for Utility Validation Pilot = Acceptable SDC with quality Q  All faults in class = Acceptable SDCs with quality Q ??? Studied several quality metrics – E.g., E = abs(percentage average relative error in output components), capped to 100% Q =  E  (> 100% error gives Q=0, 1% gives Q=99) 22   

Research Strategy Towards an Application Resiliency Profile – For a given instruction what is the outcome of a fault? – For now, focus on a transient fault in single bit of register Convert SDCs into (low-cost) detections for full reliability and quality OR let some acceptable or unacceptable SDCs escape – Quantitative tuning of reliability vs. overhead vs. quality 23

SDCs → Detections [DSN’12] 24 What to protect? - SDC-causing fault sites How to Protect? Low-cost detectors Where to place? Many errors propagate to few program values What detectors?Program-level properties tests Uncovered fault-sites? Selective instruction-level duplication

Insights for Program-level Detectors Goal: Identify where to place the detectors and what detectors to use Placement of detectors ( where ) – Many errors propagate to few program values  End of loops and function calls Detectors ( what ) – Test program-level properties  E.g., comparing similar computations and checking value equality 25

Loop Incrementalization 26 A = base addr. of a B = base addr. of b L: load r1 ← [A]... load r2 ← [B]... store r3 → [A]... add A = A + 0x8 add B = B + 0x8 add i = i + 1 branch (i<n) L Array a, b; For (i=0 to n) {... a[i] = b[i] + a[i]... } C Code ASM Code

Loop Incrementalization 27 A = base addr. of a B = base addr. of b L: load r1 ← [A]... load r2 ← [B]... store r3 → [A]... add A = A + 0x8 add B = B + 0x8 add i = i + 1 branch (i<n) L Array a, b; For (i=0 to n) {... a[i] = b[i] + a[i]... } C Code ASM Code Where: Errors from all iterations propagate here in few quantities Collect initial values of A, B, and i SDC-hot app sites No loss in coverage - lossless

Registers with Long Life Some long lived registers are prone to SDCs For detection – Duplicate the register value at its definition – Compare its value at the end of its life No loss in coverage - lossless Life time 28 R1 definition Copy Use 1 Compare Use 2 Use n......

Converting SDCs to Detections: Results Discovered common program properties around most SDC-causing sites Devised low-cost program-level detectors – Average SDC reduction of 84% – Average cost of 10% New detectors + selective duplication = Tunable resiliency at low-cost – Found near optimal detectors for any SDC target 29

Identifying Near Optimal Detectors: Naïve Approach 30 Bag of detectors SDC coverage SFI 50% Example: Target SDC coverage = 60% Sample 1 Overhead = 10% Sample 2 Overhead = 20% SFI 65% Tedious and time consuming

Identifying Near Optimal Detectors: Our Approach 31 Bag of detectors Selected Detectors SDC Covg.= X% Overhead = Y% Detector 1. Set attributes, enabled by Relyzer 2. Dynamic programming Constraint: Total SDC covg. ≥ 60% Objective: Minimize overhead Overhead = 9% Obtained SDC coverage vs. Performance trade-off curves [DSN’12]

SDC Reduction vs. Overhead Trade-off Curve Consistently better over pure instruction-level duplication (w/ Relyzer) But overhead still significant for very high resilience Remove protection overhead for acceptable (bounded) quality loss? 32 Our detectors + Redundancy Pure Redundancy 18% 90% 24% 99% (For Relyzer identified SDCs)

Understanding Quality Outcomes with Relyzer (Prelim Results) Promising potential, but quality metric and application dependent Can use to quantitatively tune quality vs. reliability vs. resources 33

Quality Outcomes – An Instruction-Centric View Systems operate at higher granularity than error sites: Instruction? When is an instruction approximable? – Which errors in instruction should result in acceptable outcome?  All? Single-bit errors in all register bits? Single-bit errors in subset of bits? – When is an outcome acceptable?  Best case : all errors that are masked or produce SDCs are acceptable 34

So - (Old) Hype or New Frontier? SWAT inherently implies quality/reliability relaxation aka approx. computing Key: Bound quality/reliability loss? (Subject to resource constraint) – Can serve as enabler for widespread adoption of resilience approach – Especially if automated Some other open questions – Instruction vs. data-centric? – Resiliency profiles and analysis at higher granularity? – How to compose? – Impact on recovery? – Other fault models? Software doesn’t ship with 100% test coverage, why should hardware? 35

Summary 36 Overhead (perf., power, area) Reliability Tunable reliability SWAT Very high reliability at low-cost SWAT symptom detectors: <0.2% SDCs at near-zero cost Relyzer: Systematically find remaining SDC-causing app-sites Convert SDCs to detections with program-level detectors Tunable reliability vs. overhead Ongoing: Add tunable quality, Relyzer validations promising SWAT symptom detectors: <0.2% SDCs at near-zero cost Relyzer: Systematically find remaining SDC-causing app-sites Convert SDCs to detections with program-level detectors Tunable reliability vs. overhead Ongoing: Add tunable quality, Relyzer validations promising