Scott Mahlke University of Michigan

Slides:



Advertisements
Similar presentations
Subthreshold SRAM Designs for Cryptography Security Computations Adnan Gutub The Second International Conference on Software Engineering and Computer Systems.
Advertisements

CSCE430/830 Computer Architecture
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.
Availability in Globally Distributed Storage Systems
CS 795 – Spring  “Software Systems are increasingly Situated in dynamic, mission critical settings ◦ Operational profile is dynamic, and depends.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Self-calibrated.
Cost-Efficient Soft Error Protection for Embedded Microprocessors
Unreliable Silicon: Myth or Reality? Shubu Mukherjee Principal Engineer Director, SPEARS Group (SPEARS = Simulation & Pathfinding of Efficient And Reliable.
Software Issues Derived from Dr. Fawcett’s Slides Phil Pratt-Szeliga Fall 2009.
Page 1 Copyright © Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer.
University of Michigan Electrical Engineering and Computer Science 1 StageNet: A Reconfigurable CMP Fabric for Resilient Systems Shantanu Gupta Shuguang.
University of Michigan Electrical Engineering and Computer Science 1 Top 5 Reasons Reliability is the Biggest Fallacy in Computer Architecture Research.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Maestro: Orchestrating.
SM3121 Software Technology Mark Green School of Creative Media.
University of Michigan Electrical Engineering and Computer Science 1 Online Timing Analysis for Wearout Detection Jason Blome, Shuguang Feng, Shantanu.
Storage System: RAID Questions answered in this lecture: What is RAID? How does one trade-off between: performance, capacity, and reliability? What is.
GPU-Qin: A Methodology For Evaluating Error Resilience of GPGPU Applications Bo Fang , Karthik Pattabiraman, Matei Ripeanu, The University of British.
Redundant Array of Inexpensive Disks (RAID). Redundant Arrays of Disks Files are "striped" across multiple spindles Redundancy yields high data availability.
Models for Software Reliability N. El Kadri SEG3202.
Ethical and Social...J.M.Kizza 1 Module 8: Software Issues: Risks and Liabilities Definitions Causes of Software Failures Risks Consumer Protection Improving.
Redundant Array of Independent Disks
Intelligent Systems Programming COMM2M Harry R. Erwin, PhD University of Sunderland.
HW/SW/FW Allocation – Page 1 of 14CSCI 4717 – Computer Architecture CSCI 4717/5717 Computer Architecture Allocation of Hardware, Software, and Firmware.
Basic Computer Components. What’s inside your computer?
Software Software is omnipresent in the lives of billions of human beings. Software is an important component of the emerging knowledge based service.
Intro to Architecture – Page 1 of 22CSCI 4717 – Computer Architecture CSCI 4717/5717 Computer Architecture Topic: Introduction Reading: Chapter 1.
B. Todd AB/CO/MI 30 th January 2008 Safety in Mind…
Software Measurement & Metrics
Jump to first page (c) 1999, A. Lakhotia 1 Software engineering? Arun Lakhotia University of Louisiana at Lafayette Po Box Lafayette, LA 70504, USA.
Reminder Lab 0 Xilinx ISE tutorial Research Send me an if interested Looking for those interested in RC with skills in compilers/languages/synthesis,
Introduction to Reconfigurable Computing Greg Stitt ECE Department University of Florida.
G53SEC 1 Reference Monitors Enforcement of Access Control.
10/03/05 Johan Muskens ( TU/e Computer Science, System Architecture and Networking.
Software Reliability Research Pankaj Jalote Professor, CSE, IIT Kanpur, India.
Managing Complexity: Systems Design March 2, 2001.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Bundled Execution.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Adaptive Online Testing.
1 Fault-Tolerant Computing Systems #1 Introduction Pattara Leelaprute Computer Engineering Department Kasetsart University
In-Place Decomposition for Robustness in FPGA Ju-Yueh Lee, Zhe Feng, and Lei He Electrical Engineering Dept., UCLA Presented by Ju-Yueh Lee Address comments.
Adding Algorithm Based Fault-Tolerance to BLIS Tyler Smith, Robert van de Geijn, Mikhail Smelyanskiy, Enrique Quintana-Ortí 1.
System Programming Basics Cha#2 H.M.Bilal. Operating Systems An operating system is the software on a computer that manages the way different programs.
Software Managed Resiliency Siva Hari Lei Chen, Xin Fu, Pradeep Ramachandran, Swarup Sahoo, Rob Smolenski, Sarita Adve Department of Computer Science University.
KAASHIV INFOTECH – A SOFTWARE CUM RESEARCH COMPANY IN ELECTRONICS, ELECTRICAL, CIVIL AND MECHANICAL AREAS
CS203 – Advanced Computer Architecture Dependability & Reliability.
Yuxi Liu The Chinese University of Hong Kong Circuit Timing Problem Driven Optimization.
RAID.
Introduction to: The Architecture of the Internet
Large Distributed Systems
Lessons from The File Copy Assignment
Introduction SOFTWARE ENGINEERING.
Fault Tolerance & Reliability CDA 5140 Spring 2006
Introduction CSE 1310 – Introduction to Computers and Programming
Introduction to Reconfigurable Computing
Get In Touch With Canon Printer Phone Number For Online Tech support
We are the one of the best Windows 10 support provider in the whole world. If you want Windows 10 support number than contact us our toll free number.
Top 5 Hardware Issues And Troubleshoot By I FIX PC
Introduction to: The Architecture of the Internet
Maestro: Orchestrating Lifetime Reliability in Chip Multiprocessors
CSC Classes Required for TCC CS Degree
Fault Tolerance Distributed Web-based Systems
Introduction to: The Architecture of the Internet
Chapter 5: Software effort estimation
Mattan Erez The University of Texas at Austin July 2015
Introduction to Fault Tolerance
Saul Greenberg Human Computer Interaction Presented by: Kaldybaeva A., Aidynova E., 112 group Teacher: Zhabay B. University of International Relations.
Introduction to Embedded Systems
Enabling ML Based Research
Guihai Yan, Yinhe Han, and Xiaowei Li
Map of Human Computer Interaction
Facts About High-Performance Computing
Presentation transcript:

Scott Mahlke University of Michigan Top 5 Reasons Reliability is the Biggest Fallacy in Computer Architecture Research Scott Mahlke University of Michigan Thanks to Jason Blome, Shuguang Feng, and Shantanu Gupta for putting their research on reliable systems on hold to help with this presentation. 1

Still a need for high reliability designs for mission critical systems Disclaimer Still a need for high reliability designs for mission critical systems Space shuttle, airplanes, etc. Cost is not an issue – use high degrees of redundancy Legacy code was not developed to give tools deep knowledge about the computation being performed. E.g., source code does not express high-level properties and assumptions The question is how to go from parallel algorithms to parallel machine code I would like to convince you reliability is a fallacy for mainstream computer systems used in consumer/business electronics* *The speaker may not agree with this position 2

Reason 1: It’s the Software, Stupid! Reliability of softwares that have a large consumer base (like Windows) has not shown improvement. The failures per billion hours of operation in Windows are an order of magnitude higher than corresponding value for hardware components. In the data shown here, the software errors are mostly of transient nature, while the hardware errors are permanent. “Mature OS can have an MTTF measured in months, while newer OS may crash every few days.” – Peter Chen: Reliability Hierarchies, 1999 HOT OS. Sources: [1] www.nstl.com [2] A system-level approach for memory robustness, ICMTD05 [3] Lifetime Reliability: Towards an architectural solution, IEEE Micro 2005 [4] www.calce.umd.edu 3

Hmm… My ATM Does Not Work 4

Reason 2: Disposable Electronics “The average working life of a mobile phone is 7 years, but the average consumer changes their mobile every 11 months. 5

PCs/Laptops Not Far Behind “Take-away something.” http:/ieeexplore.ieee.org/iel5/9100/28876/01299720.pdf 6

Reason 3: A Transient Fault is About As Likely As … 7

Reason 4: Does Anyone Care? Which is flawed? Can a human identify errors in video, images, or sound? Glitches are accepted by the consumer (dropped cell calls) Natural redundancy and resiliency in software 100% reliable operation of hardware is not important or worth extra cost in many situations 8

Reason 5: This Problem is Better Solved Closer to the Circuit Level Intra-die variations in ILD thickness Electromigration in copper Error_L Error comparator RAZOR FF clk_del Main Flip-Flop clk Shadow Latch Q1 D1 1 Lower overhead Many designs benefit In-situ solutions naturally handle variation 9

Some Hope? What if we assume reliability is a looming problem. Then we need solutions that are: Low overhead, high rate of return solutions Joint circuit/architectural techniques Domain specific solutions – know thy customer Reliability features provide other benefits Its not just a tax The bottom line 10