1 High-availability and disaster recovery  Dependability concepts:  fault-tolerance, high-availability  High-availability classification  Types of.

Slides:



Advertisements
Similar presentations
Express5800/ft series servers Product Information Fault-Tolerant General Purpose Servers.
Advertisements

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
Business Continuity Section 3(chapter 8) BC:ISMDR:BEIT:VIII:chap8:Madhu N PIIT1.
2. Computer Clusters for Scalable Parallel Computing
Distributed databases
© 2009 EMC Corporation. All rights reserved. Introduction to Business Continuity Module 3.1.
High Availability Group 08: Võ Đức Vĩnh Nguyễn Quang Vũ
1 © Copyright 2010 EMC Corporation. All rights reserved. EMC RecoverPoint/Cluster Enabler for Microsoft Failover Cluster.
Modern Distributed Systems Design – Security and High Availability 1.Measuring Availability 2.Highly Available Data Management 3.Redundant System Design.
Reliability Week 11 - Lecture 2. What do we mean by reliability? Correctness – system/application does what it has to do correctly. Availability – Be.
Keith Burns Microsoft UK Mission Critical Database.
Clusters Part 4 - Systems Lars Lundberg The slides in this presentation cover Part 4 (Chapters 12-15) in Pfister’s book. We will, however, only present.
Deployment Options Frank Bergmann
1© Copyright 2011 EMC Corporation. All rights reserved. EMC RECOVERPOINT/ CLUSTER ENABLER FOR MICROSOFT FAILOVER CLUSTER.
Installing software on personal computer
John Graham – STRATEGIC Information Group Steve Lamb - QAD Disaster Recovery Planning MMUG Spring 2013 March 19, 2013 Cleveland, OH 03/19/2013MMUG Cleveland.
Microsoft Load Balancing and Clustering. Outline Introduction Load balancing Clustering.
CHAPTER OVERVIEW SECTION 5.1 – MIS INFRASTRUCTURE
Lecture 13 Fault Tolerance Networked vs. Distributed Operating Systems.
Chapter 10 : Designing a SQL Server 2005 Solution for High Availability MCITP Administrator: Microsoft SQL Server 2005 Database Server Infrastructure Design.
Business Continuity and Disaster Recovery Chapter 8 Part 2 Pages 914 to 945.
What Is A Network A network is a group of computers interconnected with communication lines which allows users to share information and resources.
I/O – Chapter 8 Introduction Disk Storage and Dependability – 8.2 Buses and other connectors – 8.4 I/O performance measures – 8.6.
Guide to Linux Installation and Administration, 2e 1 Chapter 9 Preparing for Emergencies.
Understanding Networks. What is a Network?  A network consists of two or more computers that are linked in order to share resources (such as printers.
1 Lecture 20: Parallel and Distributed Systems n Classification of parallel/distributed architectures n SMPs n Distributed systems n Clusters.
Thanks to Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 1: Introduction n What is an Operating System? n Mainframe Systems.
Chapter 3: Operating-System Structures System Components Operating System Services System Calls System Programs System Structure Virtual Machines System.
 Introduction to Operating System Introduction to Operating System  Types Of An Operating System Types Of An Operating System  Single User Single User.
Chapter 8 Implementing Disaster Recovery and High Availability Hands-On Virtual Computing.
Chapter 2: Non functional Attributes.  It infrastructure provides services to applications  Many of these services can be defined as functions such.
Input/ Output By Mohit Sehgal. What is Input/Output of a Computer? Connection with Machine Every machine has I/O (Like a function) In computing, input/output,
Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data.
1 Nassau Community CollegeProf. Vincent Costa Session 7 Infrastructures Sustainable Technologies CMP 117 Business Computing: Concepts &Applications.
1 Selecting LAN server (Week 3, Monday 9/8/2003) © Abdou Illia, Fall 2003.
المحاضرة الاولى Operating Systems. The general objectives of this decision explain the concepts and the importance of operating systems and development.
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 3: Operating-System Structures System Components Operating System Services.
McLean HIGHER COMPUTER NETWORKING Lesson 15 (a) Disaster Avoidance Description of disaster avoidance: use of anti-virus software use of fault tolerance.
 High-Availability Cluster with Linux-HA Matt Varnell Cameron Adkins Jeremy Landes.
PARALLEL PROCESSOR- TAXONOMY. CH18 Parallel Processing {Multi-processor, Multi-computer} Multiple Processor Organizations Symmetric Multiprocessors Cache.
70-293: MCSE Guide to Planning a Microsoft Windows Server 2003 Network, Enhanced Chapter 12: Planning and Implementing Server Availability and Scalability.
Distributed System Services Fall 2008 Siva Josyula
High Availability in DB2 Nishant Sinha
A Fully Automated Fault- tolerant System for Distributed Video Processing and Off­site Replication George Kola, Tevfik Kosar and Miron Livny University.
CHAPTER 7 CLUSTERING SERVERS. CLUSTERING TYPES There are 2 types of clustering ; Server clusters Network Load Balancing (NLB) The difference between the.
What Is A Network A network is a group of computers interconnected with communication lines which allows users to share information and resources.
Install, configure and test ICT Networks
Cloud Computing – UNIT - II. VIRTUALIZATION Virtualization Hiding the reality The mantra of smart computing is to intelligently hide the reality Binary->
Virtual Machine Movement and Hyper-V Replica
An Introduction to Local Area Networks An Overview of Peer-to-Peer and Server-Based Models.
System Components Operating System Services System Calls.
Lecture 11. Switch Hardware Nowadays switches are very high performance computers with high hardware specifications Switches usually consist of a chassis.
PERFORMANCE MANAGEMENT IMPROVING PERFORMANCE TECHNIQUES Network management system 1.
Applied Operating System Concepts
Large Distributed Systems
Maximum Availability Architecture Enterprise Technology Centre.
CHAPTER OVERVIEW SECTION 5.1 – MIS INFRASTRUCTURE
Introduction to Networks
Operating System Concepts
Distributed computing deals with hardware
UNIT IV RAID.
Specialized Cloud Architectures
Chapter 2: Operating-System Structures
CSC3050 – Computer Architecture
Chapter 2: Operating-System Structures
Operating System Concepts
Seminar on Enterprise Software
Presentation transcript:

1 High-availability and disaster recovery  Dependability concepts:  fault-tolerance, high-availability  High-availability classification  Types of outages  Failover  Replication  Switchback  Watchdogs  Disaster recovery

2 Clusters  a subclass of distributed systems  a small scale (mostly) homogeneous (the same hardware and OS) array of computers (located usually in one site) dedicated to small number of well defined tasks in solving of which the cluster acts as one single whole.  typical tasks for “classic” distributed systems:  file services from/to distributed machines over (college) campus  distributing workload to all machine on campus  typical tasks for a cluster:  high-availability web-service/file service, other high-availability applications  computing “farms”.

3 Dependability concepts  two aspects of dependability  reliability – probability that system survives till certain time t mean time to failure MTTF (expected life)  availability - probability that the system operates correctly at given point in time mean time to repair MTTR – speed of repairing availability = MTTF/(MTTF + MTTR)  can a system be available but not reliable?  does higher reliability improve the system’s availability?  what kind of systems need to be available? reliable?

4 High-availability scale  a system is classified by the amount of downtime it allows  1 - campus networks  2 - usual non-clustered commodity stand-alone machines  3 - usual cluster (4 possible)  5 - telephone switches  6 - in-flight aircraft computers availability total accumulated outageclass (#of 9s) per year 90% more than a month 0/1 99% under 4 days1/2 99.9% under 9 hours2/ % about 1 hour3/ % over 5 minutes4/ % about half a minute5/ % about 3 seconds 6  system’s availability usually expressed as a percentile of uptime or class

5 Types of outages, failover  two types of outages  unplanned - caused by faults  planned - need for maintenance of the system (backups, OS upgrades, upgrades, etc.)  certain systems should work reliably only part of the time - stock- exchange computers, in-flight computers  if the system should be available round the clock the objective is to minimize both types of outages  simplest high availability cluster: backup server with failover  failover - the process of transferring control from failed server to the backup server  failback - the process of transferring control from backup server to primary server  cluster with failover helps avoid planned as well as unplanned outages

6 Replication and Switchover  Two types of cluster failover organization:  replication (shared-nothing cluster) – backup server keeps its own copy of data  switchover (shared-data cluster) – the backup has access to the storage devices used by the primary Replication + easier to add to an existing single machine + easier to configure + can use any old I/O adapters and controllers + can use simple storage units - 1-to-many backup is hard - requires another copy of storage - CPU overhead in normal operation - synchronization needed - failback requires additional copying Switchover - harder to add - must modify existing cabling - harder to configure - requires specialized I/O devices - must used hardened storage like RAID + 1-to-many backup possible as long as interconnect allows + only one copy of storage used + no overhead in normal operation + no copying on failback

7 Watchdogs  watchdog is a mechanism of notification (and possible correction) of a failure.  simplest (software) watchdog - a process monitoring application processes. If the monitored process fails watchdog may take recovery action.  watchdog can run on the same machine as the application program - may not be very useful if the machine crushes  on different machine - how is communication carried out?  application process may be programmed to cooperate with the watchdog; Three ways cooperation:  heartbeat - periodic notification sent to the watchdog by the application process to confirm its correct execution. Alternate heartbeat paths - network, RS-232, SCSI application initiated watchdog initiated  idle notification - application informs watchdog that it is idle  error notification - application notifies that it encountered an error it cannot correct

8 Disaster recovery  Disaster - failure that affects the large portions or the whole site - fire, flood, storm-damage  usual recovery technique - resume operations on the system outside the scope of the disaster tierdescription 0no disaster recovery 1backups are periodically taken and stored off premises 2backups are taken to a “hot-site” where they can be loaded on a secondary system if necessary 3electronic vaulting - network connects primary site and secondary site, back-ups are transferred by network 4active secondary - data send over the wire, the data is kept loaded and ready to run on secondary 5secondary is kept completely up-to-date