Slide 1 Recovery-Oriented Computing Aaron Brown, Dan Hettenna, David Oppenheimer, Noah Treuhaft, Leonard Chung, Patty Enriquez, Susan Housand, Archana.

Slides:

Advertisements

Similar presentations

Ed Duguid with subject: MACE Cloud

Advertisements

1 Chapter 11: Data Centre Administration Objectives Data Centre Structure Data Centre Structure Data Centre Administration Data Centre Administration Data.

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.

The Case for Drill-Ready Cloud Computing Vision Paper Tanakorn Leesatapornwongsa and Haryadi S. Gunawi 1.

Profit from the cloud TM Parallels Dynamic Infrastructure AndOpenStack.

WSUS Presented by: Nada Abdullah Ahmed.

1 Storage Today Victor Hatridge – CIO Nashville Electric Service (615)

Distributed Processing, Client/Server, and Clusters

NPACI Panel on Clusters David E. Culler Computer Science Division University of California, Berkeley

1 In VINI Veritas: Realistic and Controlled Network Experimentation Jennifer Rexford with Andy Bavier, Nick Feamster, Mark Huang, and Larry Peterson

City University London

Reliability Week 11 - Lecture 2. What do we mean by reliability? Correctness – system/application does what it has to do correctly. Availability – Be.

Slide 1 ISTORE: System Support for Introspective Storage Appliances Aaron Brown, David Oppenheimer, and David Patterson Computer Science Division University.

CalStan 3/2011 VIRAM-1 Floorplan – Tapeout June 01 Microprocessor –256-bit media processor –12-14 MBytes DRAM – Gops –2W at MHz –Industrial.

Recovery Oriented Computing: Update Armando Fox (in loco Patterson) Summer ROC Retreat, June 2002.

Recovery Oriented Computing (ROC) Dave Patterson and a cast of 1000s: Aaron Brown, Pete Broadwell, George Candea †, Mike Chen, James Cutler †, Prof. Armando.

Undergraduate Poster Presentation Match 31, 2015 Department of CSE, BUET, Dhaka, Bangladesh Wireless Sensor Network Integretion With Cloud Computing H.M.A.

VIRTUALIZATION AND YOUR BUSINESS November 18, 2010 | Worksighted.

Cloud computing Tahani aljehani.

Comprehensive IT Consulting Services an innovative approach to business.

The vMatrix: Server Switching IEEE FTDCS’2004 Amr A. Awadallah Mendel Rosenblum Stanford University – Computer Systems Lab.

1 Motivation Goal: Create and document a black box availability benchmark Improving dependability requires that we quantify the ROC-related metrics.

Slide 1 Recovery-Oriented Computing Dave Patterson and Aaron Brown University of California at Berkeley In cooperation.

Introduction. Readings r Van Steen and Tanenbaum: 5.1 r Coulouris: 10.3.

Effectively Explaining the Cloud to Your Colleagues.

ATIF MEHMOOD MALIK KASHIF SIDDIQUE Improving dependability of Cloud Computing with Fault Tolerance and High Availability.

1 The Virtual Reality Virtualization both inside and outside of the cloud Mike Furgal Director – Managed Database Services BravePoint.

Internet Service Provisioning Phase - I August 29, 2003 TSPT Web:

2Q2008 System z High Availability – Parallel Sysplex TGVL: System z Foundation 1 System z High Availability – Value of Parallel Sysplex IBM System z z10.

Eng. Mohammed Timraz Electronics & Communication Engineer University of Palestine Faculty of Engineering and Urban planning Software Engineering Department.

IMPROUVEMENT OF COMPUTER NETWORKS SECURITY BY USING FAULT TOLERANT CLUSTERS Prof. S ERB AUREL Ph. D. Prof. PATRICIU VICTOR-VALERIU Ph. D. Military Technical.

Recovery Oriented Computing (ROC) Aaron Brown*, Pete Broadwell, George Candea †, Mike Chen, Leonard Chung*, James Cutler †, Armando Fox †, Archana Ganapathi*,

Virtualization for Storage Efficiency and Centralized Management Genevieve Sullivan Hewlett-Packard

1 COMPSCI 110 Operating Systems Who - Introductions How - Policies and Administrative Details Why - Objectives and Expectations What - Our Topic: Operating.

Windows Vista Inside Out Chapter 22 - Monitoring System Activities with Event Viewer Last modified am.

Failure Analysis of the PSTN: 2000 Patricia Enriquez Mills College Oakland, California Mentors: Aaron Brown David Patterson.

Loosely Coupled Parallelism: Clusters. Context We have studied older archictures for loosely coupled parallelism, such as mesh’s, hypercubes etc, which.

CS 390 Unix Programming Summer Unix Programming - CS 3902 Course Details Online Information Please check.

ATCA based LLRF system design review DESY Control servers for ATCA based LLRF system Piotr Pucyk - DESY, Warsaw University of Technology Jaroslaw.

CompSci Self-Managing Systems Shivnath Babu.

Chapter Six Maintaining a Computer Part II: Installing, Repairing, and Removing Applications.

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

OSIsoft High Availability PI Replication

EEL 5708 Cluster computers. Case study: Google Lotzi Bölöni.

CS 505: Thu D. Nguyen Rutgers University, Spring CS 505: Computer Structures Fault Tolerance Thu D. Nguyen Spring 2005 Computer Science Rutgers.

CompSci Self-Managing Systems Shivnath Babu.

Enterprise Cloud Computing

Stanford GSB High Tech Club Tech 101 – Session 1 Introduction to Software, Distributed Architectures, and ASPs Presented by Shawn Carolan Former Manager.

Introduction to the new mainframe © Copyright IBM Corp., All rights reserved. 1 Main Frame Computing Objectives Explain why data resides on mainframe.

Progress Report Armando Fox with George Candea, James Cutler, Ben Ling, Andy Huang.

3/12/2013Computer Engg, IIT(BHU)1 CLOUD COMPUTING-1.

CNAF Database Service Barbara Martelli CNAF-INFN Elisabetta Vilucchi CNAF-INFN Simone Dalla Fina INFN-Padua.

10/18/01Linux Reconstruction Farms at Fermilab 1 Steven C. Timm--Fermilab.

BIG DATA/ Hadoop Interview Questions.

Lecture 11. Switch Hardware Nowadays switches are very high performance computers with high hardware specifications Switches usually consist of a chassis.

OSIsoft High Availability PI Replication Colin Breck, PI Server Team Dave Oda, PI SDK Team.

Bernd Panzer-Steindel CERN/IT/ADC1 Medium Term Issues for the Data Challenges.

Let's talk about Linux and Virtualization in 'vLAMP'

Welcome to the Winter 2004 ROC Retreat

Netscape Application Server

Embracing Failure: A Case for Recovery-Oriented Computing

2016 Citrix presentation.

Large Distributed Systems

Noah Treuhaft UC Berkeley ROC Group ROC Retreat, January 2002

Maximum Availability Architecture Enterprise Technology Centre.

Scaling for the Future Katherine Yelick U.C. Berkeley, EECS

Recovery-Oriented Computing

Your Solution for: Energy Smart Management Real Time Power Monitoring Fuel Theft Prevention Technical presentation.

Software models - Software Architecture Design Patterns

Hardware-less Testing for RAS Software

Presentation transcript:

Slide 1 Recovery-Oriented Computing Aaron Brown, Dan Hettenna, David Oppenheimer, Noah Treuhaft, Leonard Chung, Patty Enriquez, Susan Housand, Archana Ganapathi, Dan Patterson, Jon Kuroda, Mike Howard, Matthew Mertzbacher, Dave Patterson, and Kathy Yelick University of California at Berkeley In cooperation with George Candea, James Cutler, and Armando Fox Stanford University

Slide 2 Agenda Schedule Wednesday 1:00 Intro, ROC talks 3:00 Break 3:30 OceanStore 6:00 Dinner 7:30 Panel Session: “Challenges and Myths in Operating Reliable Internet Services” Wireless Interect access during breaks (no access during talks/panels) –Network name is “nest”

Slide 3 Target is Services Companies like Amazon, Google, Yahoo, … Also Internal ASP model –Enterprise IT as services Since providing a single service, can do things differently –Fascinating solutions to hard problems –Change software while continually providing service Since providing service, availability is killer metric Plausible model for future of IT?

Slide 4 The past: goals and assumptions of last 15 years Goal #1: Improve performance Goal #2: Improve performance Goal #3: Improve cost-performance Assumptions –Humans are perfect (they don’t make mistakes during installation, wiring, upgrade, maintenance or repair) –Software will eventually be bug free (good programmers write bug-free code, debugging works) –Hardware MTBF is already very large (~100 years between failures), and will continue to increase

Slide 5 Today, after 15 years of improving performance Availability is now the vital metric for servers –near-100% availability is becoming mandatory »for e-commerce, enterprise apps, online services, ISPs –but, service outages are frequent »65% of IT managers report that their websites were unavailable to customers over a 6-month period 25%: 3 or more outages –outage costs are high »social effects: negative press, loss of customers who “click over” to competitor »$500,000 to $5,000,000 per hour in lost revenues Source: InternetWeek 4/3/2000

Slide 6 New goals: ACME Availability –24x7 delivery of service to users Change –support rapid deployment of new software, apps, UI Maintainability –reduce burden on system administrators (cost of ownership ~5X cost of purchase) –provide helpful, forgiving sysadmin environments Evolutionary Growth –allow easy system expansion over time without sacrificing availability or maintainability

Slide 7 Where does ACME stand today? Availability: failures are common –Traditional fault-tolerance doesn’t solve the problems Change –In back-end system tiers, software upgrades difficult, failure-prone, or ignored –For application service over WWW, daily change Maintainability –human operator error is single largest failure source? –system maintenance environments are unforgiving Evolutionary growth –1U-PC cluster front-ends scale, evolve well –back-end scalability still limited

Slide 8 Recovery-Oriented Computing Philosophy “If a problem has no solution, it may not be a problem, but a fact, not to be solved, but to be coped with over time” — Shimon Peres Failures are a fact, and recovery/repair is how we cope with them Improving recovery/repair improves availability –UnAvailability = MTTR MTTF –1/10th MTTR just as valuable as 10X MTBF (assuming MTTR much less than MTTF) If major Sys Admin job is recovery after failure, ROC also helps with sys admin If necessary, start with clean slate, sacrifice disk space and performance for ACME

Slide 9 ROC Approach Work with companies to get real data on failure causes and patterns –David Oppenheimer’s survey 3 sites –Patty Enriquez’s survey of FCC switch failure data Develop ACME benchmarks to test old systems & new ideas to measure improvement –Fastest to Recover from Failures v. Fastest on SPEC –Poster session database benchmark: Chang & Brown –Friday: Fault Insertion in glibc (CS 294/444A class) –Friday: Automating Root Cause Analysis (294/444A) Support for humans to operate services –Aaron Brown’s talk on Undo for SysAdmin

Slide 10 ROC Approach Cluster technology that enables partition systems, insert faults, test outputs ISTORE(ROC-I): Cluster of 64 PCs modified with ability for HW isolation, fault insertion, monitoring, diagnostic processors Cluster of 40 IBM PCs: each with 2 GB DRAM, 1 gigabit Ethernet, gigabit switch,HW monitor, each running Vmware virtual machine monitor (software layer)

Slide 11 ISTORE HW update Difficulties with dual power supplies –power drops during startup, flaky powerup of dual power supplies –led to rework and 2nd revision of backplane –continued problems getting both power supplies to work together –decision to move on using one out of two power supplies for now Now testing bricks before fabricating remainder of backplanes.

Slide 12 ISTORE Software Update Development for diagnostic processors (DPs) –Sensor library: software API to access temperature, vibration, humidity and other sensors –DP network protocol: reliable connection-based protocol over CAN-bus hardware –Remote logging interface: can log system events from a brick on a remote PC »Useful for debugging and for sensor data analysis –Brick coordination protocol: synchronization and coordination between bricks, used for »Power-up phase, to avoid power surge »Accessing devices shared by a shelf on the backplane

Slide 13 ISTORE Network Mb Ethernets interfaces/brick: –Bandwidth through striping –Failure resistant: striping automatically adjusts for failed links Two ISTORE programming models –Cluster with Linux PCs + TCP »Cluster servers, such as Apache run »Problem: Pentium can saturate 2 of 4 links –As Parallel Java (Titanium) platform »User level UDP for lower overhead communication Recently rebuilt due to eliminate concurrency problems »Protection from language model