Failover Clustering & Hyper-V: Multisite Disaster Recovery Prakash Gopinadham Support Escalation Engineer Microsoft Corporation
Multi-Site Clustering Content Design guide: http://technet.microsoft.com/en-us/library/dd197430.aspx Deployment guide/checklist: http://technet.microsoft.com/en-us/library/dd197546.aspx Customer case studies using multi-site clustering: http://blogs.msdn.com/b/clustering/archive/2009/11/04/9917628.aspx Setting multi-site cluster up is easy. Design guide and deployment checklist available to help.
Multi-Site Clustering Introduction Networking Storage Quorum
Defining High-Availability High-Availability allows applications to maintain service availability by moving them between nodes in a cluster But what if there is a catastrophic event and you lose the entire datacenter? Site A
Defining Disaster Recovery Disaster Recovery (DR) allows applications to maintain service availability by moving them to a cluster node in a different physical location Node is located at a physically separate site Site A Site B SAN Site A Site B
Benefits of a Multi-Site Cluster Protects against loss of an entire location Power Outage, Fires, Hurricanes, Floods, Earthquakes, Terrorism Automates failover Reduced downtime Lower complexity disaster recovery plan What is the primary reason why DR solutions fail? Dependence on People
Multi-Site Clustering Introduction Networking Storage Quorum
Stretching the Network Longer distance traditionally means greater network latency Missed inter-node health checks can cause false failover Cluster heartbeating is fully configurable SameSubnetDelay (default = 1 second) Frequency heartbeats are sent SameSubnetThreshold (default = 5 heartbeats) Missed heartbeats before an interface is considered down CrossSubnetDelay (default = 1 second) Frequency heartbeats are sent to nodes on dissimilar subnets CrossSubnetThreshold (default = 5 heartbeats) Missed heartbeats before an interface is considered down to nodes on dissimilar subnets Command Line: Cluster.exe /prop PowerShell (R2): Get-Cluster | fl *
Security over the WAN Encrypt inter-node communication Trade-off security versus performance 0 = clear text 1 = signed (default) 2 = encrypted 10.10.10.1 20.20.20.1 30.30.30.1 40.40.40.1 Site A Site B
Network Considerations Network Deployment Options: Stretch VLANs across sites Cluster nodes can reside in different subnets Public Network 10.10.10.1 20.20.20.1 Site A Site B 30.30.30.1 40.40.40.1 Redundant Network
DNS Considerations Nodes in dissimilar subnets VM obtains new IP address Clients need that new IP Address from DNS to reconnect DNS Replication DNS Server 1 DNS Server 2 Record Created Record Obtained Record Updated Record Updated 10.10.10.111 20.20.20.222 Site A Site B VM = 10.10.10.111 VM = 20.20.20.222
Faster Failover for Multi-Subnet Clusters RegisterAllProvidersIP (default = 0 for FALSE) Determines if all IP Addresses for a Network Name will be registered by DNS TRUE (1): IP Addresses can be online or offline and will still be registered Ensure application is set to try all IP Addresses, so clients can come online quicker HostRecordTTL (default = 1200 seconds) Controls time the DNS record lives on client for a cluster network name Shorter TTL: DNS records for clients updated sooner Exchange Server 2007 recommends a value of five minutes (300 seconds) Get-ClusterResource “Resource Name” | Get-ClusterParameter Get-ClusterResource | %{ $_.RegisterAllProvidersIP=1} Cluster.exe res “Resource Name” /priv Cluster.exe res “Resource Name” /priv RegisterAllProvidersIP=1
Solution #1: Local Failover First Configure local failover fist for high availability No change in IP addresses No DNS replication issues No data going over the WAN Cross-site failover for disaster recovery DNS Server 1 10.10.10.111 20.20.20.222 Site A Site B 10.10.10.111
Solution #2: Stretch VLANs Deploying a VLAN minimizes client reconnection times IP of the VM never changes DNS Server 2 DNS Server 1 10.10.10.111 VLAN FS = 10.10.10.111 Site A Site B
Solution #3: Abstraction in Networking Device Networking device uses independent 3rd IP Address 3rd IP Address is registered in DNS & used by client 30.30.30.30 DNS Server 2 DNS Server 1 10.10.10.111 20.20.20.222 Site A Site B VM = 30.30.30.30
Multi-Site Clustering Introduction Networking Storage Quorum
Storage in Multi-Site Clusters Different than local clusters: Multiple storage arrays – independent per site Nodes commonly access own site storage No ‘true’ shared disk visible to all nodes Site A Site B Site B Site A Site B SAN
Storage Considerations Site A Site B Site B Site A Site A Site B SAN Replica Changes are made on Site A and replicated to Site B DR requires data replication mechanism between sites
Replication Partners Hardware storage-based replication Block-level replication Software host-based replication File-level replication Appliance replication
Synchronous Replication Host receives “write complete” response from the storage after the data is successfully written on both storage devices Replication Write Request Write Complete Secondary Storage Site A Site B Primary Storage Acknowledgement
Asynchronous Replication Host receives “write complete” response from the storage after the data is successfully written to just the primary storage device, then replication Replication Write Request Write Complete Site A Site B Primary Storage Secondary Storage
Synchronous versus Asynchronous No data loss Potential data loss on hard failures Requires high bandwidth/low latency connection Enough bandwidth to keep up with data replication Stretches over shorter distances Stretches over longer distances Write latencies impact application performance No significant impact on application performance
Cluster Validation with Replicated Storage Multi-Site clusters are not required to pass the Storage tests to be supported Validation Guide and Policy http://go.microsoft.com/fwlink/ ?LinkID=119949
Challenges of Block Storage Replication Storage block level replication typically Uni-Directional (per LUN) Change blocks flow from source to remote Possible to have different LUNs replicating in different directions Storage cannot enforce block level collision resolution Application must determine resolution, or be coordinated in some way Applications today implement shared nothing model Surfacing storage as R/W at multiple sites is only useful if applications can handle a distributed access device Few applications implement the necessary support Obvious exception is Cluster Shared Volumes for Hyper-V
Multi-Site Clustering Introduction Networking Storage Quorum
Quorum Overview Majority is greater than 50% Possible Voters: Nodes (1 each) + 1 Witness (Disk or File Share) 4 Quorum Types Disk only (not recommended) Node and Disk majority Node majority Node and File Share majority Vote Vote Vote Vote Vote
Replicated Disk Witness A witness is a tie breaker when nodes lose network connectivity The witness disk must be a single decision maker, or problems can occur Do not use a Disk Witness in multi-site clusters unless directed by vendor Vote Vote Vote ? Replicated Storage
Node Majority 5 Node Cluster: Majority = 3 can I communicate with majority of the nodes in the cluster? 5 Node Cluster: Majority = 3 Can I communicate with majority of the nodes in the cluster? Yes, then Stay Up No, drop out of Cluster Membership Cross site network connectivity broken! Site A Site B Majority in Primary Site
Node Majority 5 Node Cluster: Majority = 3 Site A Disaster at Site 1 We are down! 5 Node Cluster: Majority = 3 Can I communicate with majority of the nodes in the cluster? No, drop out of Cluster Membership Site A Need to force quorum manually Site A Site B Disaster at Site 1 Majority in Primary Site
Forcing Quorum Forcing quorum is a way to manually override and start a node even if the cluster does not have quorum Important: understand why quorum was lost Cluster starts in a special “forced” state Once majority achieved, drops out of “forced” state Command Line: net start clussvc /fixquorum (or /fq) PowerShell (R2): Start-ClusterNode –FixQuorum (or –fq)
Multi-Site with File Share Witness Site C (branch office) Complete resiliency and automatic recovery from the loss of any 1 site \\Foo\Share WAN Site A Site B
Multi-Site with File Share Witness Site C (branch office) File Share Witness Can I communicate with majority of the nodes (+FSW) in the cluster? Can I communicate with majority of the nodes in the cluster? Yes, then Stay Up \\Foo\Share WAN Complete resiliency and automatic recovery from the loss of connection between sites No (lock failed), drop out of Cluster Membership Site A Site B
File Share Witness (FSW) Considerations Simple Windows File Server Single file server can serve as a witness for multiple clusters Each cluster requires it’s own share Can be made highly available on a separate cluster Recommended to be at 3rd separate site for DR FSW cannot be on a node in the same cluster FSW should not be in a VM running on the same cluster
Node and File Share Majority Quorum Model Recap Even number of nodes Highest availability solution has FSW in 3rd site Node and File Share Majority Odd number of nodes More nodes in primary site Node Majority Use as directed by vendor Node and Disk Majority Not Recommended No Majority: Disk Only
Session Summary Multi-site Failover Clusters have many benefits You can achieve high-availability and disaster recover in a single solution using Windows Server Failover Clustering Multi-site clusters have additional considerations: Determine network topology across sites Choose a storage replication solution Plan quorum model & nodes
Failover Clustering Resources Design for a Clustered Service or Application in a Multi-Site Failover Cluster http://technet.microsoft.com/en-us/library/dd197430(WS.10).aspx Checklist: Setting Up a Clustered Service or Application in a Multi-Site Failover Cluster http://technet.microsoft.com/en-us/library/dd197546(WS.10).aspx Cluster Information Portal: http://www.microsoft.com/windowsserver2008/en/us/clustering-home.aspx Clustering Technical Resources: http://www.microsoft.com/windowsserver2008/en/us/clustering-resources.aspx Clustering Forum (2008): http://forums.technet.microsoft.com/en-US/winserverClustering/threads/ http://social.technet.microsoft.com/Forums/en-US/windowsserver2008r2highavailability/threads/ R2 Cluster Features: http://technet.microsoft.com/en-us/library/dd443539.aspx
Software Application Developers Infrastructure Professionals Resources Software Application Developers Infrastructure Professionals http://msdn.microsoft.com/ http://technet.microsoft.com/ msdnindia @msdnindia technetindia @technetindia
© 2011 Microsoft Corporation. All rights reserved © 2011 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.