1 CSSE 377 – Intro to Availability & Reliability Part 2 Steve Chenoweth Tuesday, 9/13/11 Week 2, Day 2 Right – Pictorial view of how to achieve high availability.

Slides:



Advertisements
Similar presentations
Computer Systems & Architecture Lesson 2 4. Achieving Qualities.
Advertisements

Principles of Engineering System Design Dr T Asokan
Test process essentials Riitta Viitamäki,
G. Alonso, D. Kossmann Systems Group
Lecture 13 Enterprise Systems Development ( CSC447 ) COMSATS Islamabad Muhammad Usman, Assistant Professor.
FlareCo Ltd ALTER DATABASE AdventureWorks SET PARTNER FORCE_SERVICE_ALLOW_DATA_LOSS Slide 1.
1 CSSE 477 – More on Availability & Reliability Steve Chenoweth Thursday, 9/22/11 Week 3, Day 3 Right – High availability with VMWare – the major goal.
1 Software Maintenance and Evolution CSSE 575: Session 8, Part 3 Predicting Bugs Steve Chenoweth Office Phone: (812) Cell: (937)
1 CSSE 377 – Intro to Availability & Reliability Part 1 Steve Chenoweth Monday, 9/12/11 Week 2, Day 1 Right – John Musa’s “Software Reliability Engineered.
1 Steve Chenoweth Tuesday, 10/04/11 Week 5, Day 2 Right – Typical tool for reading out error codes logged by your car’s computer, to help analyze its problems.
1 Steve Chenoweth Tuesday, 10/25/11 Week 8, Day 2 Right – Desktop computer usability metaphor, from
Computer Science Lecture 12, page 1 CS677: Distributed OS Last Class Distributed Snapshots –Termination detection Election algorithms –Bully –Ring.
Software Performance Engineering - SPE HW - Answers Steve Chenoweth CSSE 375, Rose-Hulman Tues, Oct 23, 2007.
1 CSSE 477 – A bit more on Performance Steve Chenoweth Friday, 9/9/11 Week 1, Day 2 Right – Googling for “Performance” gets you everything from Lady Gaga.
Software Performance Engineering Steve Chenoweth CSSE 375, Rose-Hulman Tues, Oct 23, 2007.
1 Steve Chenoweth Tuesday, 10/18/11 Week 7, Day 2 Right – One view of the layers of ingredients to an enterprise security program. From
Transaction Processing IS698 Min Song. 2 What is a Transaction?  When an event in the real world changes the state of the enterprise, a transaction is.
Test Environments Arun Murugan – u Rohan Ahluwalia – u Shuchi Gauri – u
Soft. Eng. II, Spr. 2002Dr Driss Kettani, from I. Sommerville1 CSC-3325: Chapter 9 Title : Reliability Reading: I. Sommerville, Chap. 16, 17 and 18.
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.
Computer Science Lecture 12, page 1 CS677: Distributed OS Last Class Vector timestamps Global state –Distributed Snapshot Election algorithms.
Transaction. A transaction is an event which occurs on the database. Generally a transaction reads a value from the database or writes a value to the.
To succeed in business today, you need to be flexible and have good planning and organizational skills. Many people start a business thinking that they'll.
1 The ATAM A Comprehensive Method for rchitecture Evaluation & The CBAM A Quantitative Approach to Architecture Design Deci $ ion Making CSSE 377 Software.
Brainstorming Steve Chenoweth & Chandan Rupakheti RHIT Chapters 12 & 13, Requirements Text, Brainstorming Techniques document Brainstorming involves generating.
Meeting Skills.
Desktop Security: Worms and Viruses Brian Arkills, C&C NDC-Sysmgt.
1 Product Reliability Chris Nabavi BSc SMIEEE © 2006 PCE Systems Ltd.
Transactions and Reliability. File system components Disk management Naming Reliability  What are the reliability issues in file systems? Security.
Instructor: Tasneem Darwish1 University of Palestine Faculty of Applied Engineering and Urban Planning Software Engineering Department Software Systems.
Achieving Qualities 1 Võ Đình Hiếu. Contents Architecture tactics Availability tactics Security tactics Modifiability tactics 2.
Day 10 Hardware Fault Tolerance RAID. High availability All servers should be on UPSs –2 Types Smart UPS –Serial cable connects from UPS to computer.
1 DATABASE TECHNOLOGIES BUS Abdou Illia, Fall 2007 (Week 3, Tuesday 9/4/2007)
Abstraction IS 101Y/CMSC 101 Computational Thinking and Design Tuesday, September 17, 2013 Carolyn Seaman University of Maryland, Baltimore County.
Software Quality Assurance Lecture #4 By: Faraz Ahmed.
ECE 720T5 Winter 2014 Cyber-Physical Systems Rodolfo Pellizzoni.
1 Fault Tolerance in the Nonstop Cyclone System By Scott Chan Robert Jardine Presented by Phuc Nguyen.
1 Availability Metrics and Reliability/Availability Engineering Kan Ch 13 Steve Chenoweth, RHIT Left – Here’s an availability problem that drives a lot.
Response to Undesired Events in Software Systems Kimberly Hanks and Phil Varner A Presentation brought to you by David Parnas.
1 Designing the Architecture CSSE 477 Software Architecture Steve Chenoweth, Rose-Hulman Institute Week 3, Day 1, Monday, September 19, 2011.
Event Management & ITIL V3
How to start Milestone 1 CSSE 371 Project Info There are only 8 easy steps…
1 Software Construction and Evolution - CSSE 375 Exception Handling - Principles Steve Chenoweth, RHIT Above – Exception handling on the ENIAC. From
Chapter 15 Recovery. Topics in this Chapter Transactions Transaction Recovery System Recovery Media Recovery Two-Phase Commit SQL Facilities.
From Quality Control to Quality Assurance…and Beyond Alan Page Microsoft.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.
CprE 458/558: Real-Time Systems
The Relational Model1 Transaction Processing Units of Work.
© 2008 Sterling Commerce. Confidential and Proprietary. How to Get Along with Project Using Microsoft Project so that it actually works for you, not against.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Mutual Exclusion & Leader Election Steve Ko Computer Sciences and Engineering University.
Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery.
Fall 2015CISC/CMPE320 - Prof. McLeod1 CISC/CMPE320 Assignment 1 due today, 7pm. RAD due next Friday in your Wiki. Presentations week 6. Today: –Continue.
Cloud Computing and Architecture Architectural Tactics (Tonight’s guest star: Availability)
CS 162 Section 10 Two-phase commit Fault-tolerant computing.
Yeah but.. What do I do? Software Leadership Dan Fleck 2007.
Highly available, Fault tolerant Co-scheduling System With working implementation.
A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.
Vakgroep Informatietechnologie – IBCN Software Architecture Prof.Dr.ir. F. Gielen Quality Attributes & Tactics (1)
1 Teams Steve Chenoweth and Chandan Rupakheti. Roles  Contact Client Project Manager  Secretary  Task assigner/monitor 2.
Data Analytics – A Cost Effective Approach to Reducing Operating Costs Automatically “find what matters” in the data from building equipment systems and.
 Software reliability is the probability that software will work properly in a specified environment and for a given amount of time. Using the following.
Dealing with Database Corruption DBA 911. Who am I? 2 David M Maxwell twitter.com/dmmaxwell or twitter.com/upsearchsqltwitter.com/dmmaxwelltwitter.com/upsearchsql.
WHEN DATABASE CORRUPTION STRIKES Presented by Steve Stedman Founder/Owner of Stedman Solution, LLC.
Steve Chenoweth Office Phone: (812) Cell: (937)
X in [Integration, Delivery, Deployment]
Design for Quality Design for Quality and Safety Design Improvement
Outline Announcements Fault Tolerance.
Fault Tolerance Distributed Web-based Systems
Presentation transcript:

1 CSSE 377 – Intro to Availability & Reliability Part 2 Steve Chenoweth Tuesday, 9/13/11 Week 2, Day 2 Right – Pictorial view of how to achieve high availability through duplication of resources. Or is it instead a picture of how not to try using resources for some different activity? From

2 Today Tactics for software availability engineering… –Bass’s Ch 5 (pp ) Project 2, part 2 – tonite Biweekly quiz – second half hour Thursday HW 2 (individual)

3 Availability Tactics Try one of these 3 Strategies: –Fault detection –Fault recovery –Fault prevention See next slides for details on each 

4 Fault Detection Strategy – Recognize when things are going sour: Ping/echo – Ok – A central monitor checks resource availability Heartbeat – Ok – The resources report this automatically Exceptions – Not ok – Someone gets negative reporting (often at low level, then “escalated” if serious) Right – Everyone likes early fault detection. In hardware systems, the use of multivariate analysis is used to isolate the source of deviations in system performance. From 4&ptid=0. 4&ptid=0

5 Fault Recovery - Preparation Strategy – Plan what to do when things go sour: Voting – Analyze which is faulty Active redundancy (hot backup) – Multiple resources with instant switchover Passive redundancy (warm backup) – Backup needs time to take over a role Spare – A very cool backup, but lets 1 box backup many different ones

6 Fault Recovery - Reintroduction Strategy – Do the recovery of a failed component - carefully: Shadow operation – Watch it closely as it comes back up, let it “pretend” to operate State resynchronization – Restore missing data – Often a big problem! –Special mode to resynch before it goes “live” –Problem of multiple machines with partial data Checkpoint/rollback – Verify it’s in a consistent state

7 Fault Prevention (in book) Runtime Strategy – Don’t even let it happen! Removal from service – Other components decide to take one out of service if it’s “close to failure” Transactions – Ensure consistency across servers. “ACID” model* is: –Atomicity –Consistency Process monitor – Make a new instance (like of a process) –Isolation –Durability *ACID Model - See for example

8 Fault Prevention (not in book) Construction Strategy – spend time on the software that’s most critical to availability. Let’s assume you have a fixed amount of time for developing the software. Divide the components into 3 classes: –Gold – The top feature, starting the system, backup & recovery, software needed for testing, … –Silver – Other key features –Bronze – Everything else Spend almost all your time achieving quality on the Gold!

9 Hardware basics Know your availability model! But which one do you really have? A = a 1 * a 2 a1a1 a2a2 A = 1 - ((1 - a 1 )*(1 - a 2 )) a1a1 a2a2 A = 1 - ((1 - a 1 )*(1 - a 2 )*(1 - a 3 )) a1a1 a2a2 a3a3

10 Interesting observations In duplicated systems, most crashes occur when one part already is down – why? Most software testing, for a release, is done until the system runs without severe errors for some designated period of time Time Number of failures Predicted time when target reached

11 What’s next on Project 2? Continuing this project, –Determine the availability of the current system, and –Implement a tactic to improve it by a designated amount! And a next step to take today: Decide on a strategy to test the current availability of the system (or features). Some “stimulator,” etc., that you can build over the weekend. Pick a tactic which you believe you can implement to improve on it. –Pick a method from Bass’s Ch5, specific in the categories of fault detection, recovery or prevention, and be particular about it. Turn in, in your team journal by 11:55 PM tonight.

12 Warning – you’re looking for problems speculatively Not every idea is a good one – just ask Zog from the Far Side…