Presentation is loading. Please wait.

Presentation is loading. Please wait.

An Empirical Examination of Current High-Availability Clustering Solutions’ Performance Jeffrey Absher DePaul University Research Symposium Presentation.

Similar presentations


Presentation on theme: "An Empirical Examination of Current High-Availability Clustering Solutions’ Performance Jeffrey Absher DePaul University Research Symposium Presentation."— Presentation transcript:

1 An Empirical Examination of Current High-Availability Clustering Solutions’ Performance Jeffrey Absher DePaul University Research Symposium Presentation November 2003 See actual paper for bibliographical, procedural info, and appropriate academic reference information

2 HA and Related Technology Distributed OS Distributed OS Load Balancing Load Balancing Disaster Recovery Disaster Recovery Fault Tolerance Fault Tolerance HA clustering HA clustering

3 HA’s defining traits SPOF avoided by using redundancy SPOF avoided by using redundancy Single image to the outside world using a single virtual IP address and hostname Single image to the outside world using a single virtual IP address and hostname Automated fault management and recovery Automated fault management and recovery Multiple access paths from each cluster node to each resource group (set of HA services) Multiple access paths from each cluster node to each resource group (set of HA services) Simple abstraction for applications and administrators Simple abstraction for applications and administrators Undisrupted (or minimal disrupted) services during failover. Undisrupted (or minimal disrupted) services during failover. “If a computer breaks down, the functions performed by that computer will be handled by some other computer in the cluster.”

4 A cluster and tester topology

5 Event/FailureWhat does it Simulate? BaselineNo Events Kill process on Primary serverA simple fault that causes an abend to the HA process but does not take out the server. Kill process on primary server and hold the process down for 30 seconds A core dump that takes a long time or a more complex fault. Kill process on primary, hold down for 30 seconds and fail to start on second node A core dump or more complex fault, as well as a misconfiguration on the secondary server. Kill the cluster/watchdog process on the primary server A bug in the cluster programming that causes an abend or a mistaken shutdown of the cluster processes. Short power failure on primary nodeA single node power failure, technician error, or a loose power-cable, etc. Simultaneous power failure on both nodes, primary/secondary recovers first. A datacenter power failure with the two possible recovery orders For AIX and Linux, Loss of serial communication for 60 seconds. For Windows, the Virtual Shared disk processes were killed and disabled for 60 seconds. A loose serial cable or technician error such as a cable disconnect, a port misconfiguration, or a mistaken command such as echo hello> /dev/tty0. Primary/Secondary Server public network loss for 60 seconds A loose network cable or a technician error such as a cable disconnect, card misconfiguration, or a mistaken command such as ifconfig en0 down. Public/Private network down 60 secondsA power failure on the public hub or MAU, a network storm, or a technican ’ s error such as a VLAN misconfiguration. IP address clash public network for 60 seconds.A situation where another machine on the same VLAN is accidentally brought online with an incorrect IP address.

6

7

8

9 Inter OS Comparison AIXWin2KLinux Configuration most difficult reasonablesimplest Scripting required? somenonemuch Featuresmanymanyfew OS integration mediumhighlow/none InstallationInterdependentIndependentIndependent Trials with HA resulting in a longer outage 4/142/143/14 Trials requiring manual intervention 011

10 Subjective Observations HA clustering is difficult to configure properly and the available documentation is lacking HA clustering is difficult to configure properly and the available documentation is lacking Multiple machines must be configured simultaneously, often packages and software must be installed and configured in a specific order. Multiple machines must be configured simultaneously, often packages and software must be installed and configured in a specific order. For what should be a loosely-coupled system, there are many interdependencies. For what should be a loosely-coupled system, there are many interdependencies. Youn et al suggest that the design of “administration of clusters…needs improvement,” – I agree Youn et al suggest that the design of “administration of clusters…needs improvement,” – I agree Vogels et al state, “Users find it difficult to configure clusters with the desired management … properties. It is difficult to configure applications to be automatically launched in an appropriate order. Lacking solutions to these problems, clusters will remain awkward and time-consuming tools.” - I agree Vogels et al state, “Users find it difficult to configure clusters with the desired management … properties. It is difficult to configure applications to be automatically launched in an appropriate order. Lacking solutions to these problems, clusters will remain awkward and time-consuming tools.” - I agree

11 Objective Conclusions Based on Empirical Evidence HA is not a perfect solution for every environment, and may be a bad solution for some, depending on the expected faults. HA is not a perfect solution for every environment, and may be a bad solution for some, depending on the expected faults. High failover time for some systems contributes to a lower-than- expected performance of HA systems when compared to non-HA systems. High failover time for some systems contributes to a lower-than- expected performance of HA systems when compared to non-HA systems. Failover times need to be significantly smaller than the time required for a reboot or even a restart of a slow-to-start process. Failover times need to be significantly smaller than the time required for a reboot or even a restart of a slow-to-start process. Primary-node negotiation time at boot contributes to poor performance during power outages. Primary-node negotiation time at boot contributes to poor performance during power outages. There were cases where clustering is shown to actually decrease the uptime of a service or site. There were cases where clustering is shown to actually decrease the uptime of a service or site.

12 Q & A


Download ppt "An Empirical Examination of Current High-Availability Clustering Solutions’ Performance Jeffrey Absher DePaul University Research Symposium Presentation."

Similar presentations


Ads by Google