VAROVANJE VIRTUALIZIRANEGA DATACENTRA – VISOKA RAZPOLOŽLJIVOST Gorazd Šemrov Microsoft Consulting Services
DATA PROTECTION PLANNING CONSIDERATIONS What needs protection? Local resources (physical & virtual) Remote sites What are your recovery goals? Prioritize by tier Organizational expectations Do you have a disaster recovery plan? Downtime RPO/RTO Testing How much bandwidth do you have to manage protection? Is your time better spent on other priorities? What are your budgetary realties?
HOST CLUSTERING Cluster service runs inside (physical) host and manages VMs VMs move between cluster nodes Live Migration – No downtime Quick Migration – Session state saved to disk CLUSTER SAN
CLUSTER GUEST CLUSTERING Cluster service runs inside a VM Apps and services inside the VM are managed by the cluster Apps move between clustered VMs iSCSI
GUEST VS. HOST: HEALTH DETECTION FaultHost ClusterGuest Cluster Host Hardware Failure Parent Partition Failure VM Failure Guest OS Failure Application Failure
HOST + GUEST CLUSTERING Optimal solution offer the most flexibility and protection VM high-availability & mobility between physical nodes Application & service high-availability & mobility between VMs Increases complexity CLUSTER GUEST CLUSTER SAN CLUSTER SAN iSCSI
SETTINGS: ANTIAFFINITYCLASSNAMES AntiAffinityClassNames Groups with same AACN try to avoid moving to same node us/library/aa369651(VS.85).aspxhttp://msdn.microsoft.com/en- us/library/aa369651(VS.85).aspx Enables VM distribution across host nodes Better utilization of host OS resources Failover behavior on large clusters: KB KB
SETTINGS: AUTO-START Mark groups as lower priority Enables the most important VM to start first Group property Enabled by default Disabled VMs needs manual restart to recover after a crash
SETTINGS: PERSISTENT MODE HA Service or Application will return to original owner Better VM distribution after cold start Enabled by default for VM groups Disabled by default for other groups
MULTI-SITE CLUSTERING CONSIDERATIONS Network Compute Quorum Storage
MULTI-SITE CLUSTERS FOR DISASTER RECOVERY What are Multi-Site Clusters? A single cluster solution extended over metropolitan wide distances to protect against datacenter failures Site A Site B Nodes are located at a physically separate site
MULTI-SITE CLUSTERING CONSIDERATIONS Network Compute Quorum Storage
STORAGE DEPLOYMENT OPTIONS Cluster Traditional Storage Shared-Nothing Storage Model Unit of Failover at LUN/Disk level Ideal for Hyper-V Quick Migration scenarios Disks
STORAGE DEPLOYMENT OPTIONS Disk Multiple nodes to concurrently access Unit of Failover is at VM level Ideal for Hyper-V Quick and Live Migration VENDOR support Cluster Shared Volumes (CSV)
REPLICATION METHOD: SYNCHRONOUS Host receives “write complete” response from the storage after the data is successfully written on both storage devices Primary Storage Secondary Storage 4. Write Complete 2. Replication 3. Acknowledgement 1. Write Request
REPLICATION METHOD: ASYNCHRONOUS Host receives “write complete” response from the storage after the data is successfully written to just the primary storage device, then replication Primary Storage Secondary Storage 2. Write Complete 1. Write Request 3. Replication
COMPARING DATA REPLICATION METHODS SynchronousAsynchronous Recovery Point Objectives (RPO) High Business Impact, critical application (RPO = 0) Medium-Low Business Impact, critical applications ( RPO > 0 ) Application I/O Performance For applications not sensitive to high IO latency Applications sensitive to high IO latency Distance between sites50 km to 300 km >200 km Bandwidth costHighMid-Low
CLUSTER VALIDATION WITH REPLICATED STORAGE Multi-Site clusters are not required to pass the Storage tests to be supported Validation Guide and Policy
ASYMMETRICAL STORAGE SUPPORT IN SP1 Improves multi-site cluster experience Storage only visible to a subset of nodes Storage Topology used for Smart placement Workload placement based on it’s underlying storage connectivity Disk set #1 Disk set #2 N1 N2 N3 N4 Disk #1 is visible on N1&N2 and Disk #2 on N3 & N4 SQL and non-SQL workloads separated
CHOOSING A STRETCHED STORAGE MODEL Traditional Cluster Storage Cluster Shared Volumes Live Migration Hardware ReplicationConsult vendor Software Replication Appliance ReplicationConsult vendor
MULTI-SITE CLUSTERING CONSIDERATIONS Network Compute Quorum Storage
NETWORK DEPLOYMENT OPTIONS Stretched VLANs Site B Public Network * * Redundant Network Site A
NETWORK DEPLOYMENT OPTIONS Site B Different Subnets Public Network * * * * Redundant Network Site A
CHALLENGES WITH STRETCHED NETWORKS
STRETCHING THE NETWORK Clustering has no distance limitations (although 3 rd party plug-ins may) Longer distance traditionally means greater network latency Missed inter-node health checks can cause false failover Cluster heartbeating is fully configurable SameSubnetDelay (default = 1 second) Frequency heartbeats are sent SameSubnetThreshold (default = 5 heartbeats) Missed heartbeats before an interface is considered down CrossSubnetDelay (default = 1 second) Frequency heartbeats are sent to nodes on dissimilar subnets CrossSubnetThreshold (default = 5 heartbeats) Missed heartbeats before an interface is considered down to nodes on dissimilar subnets Command Line: Cluster.exe /prop PowerShell (R2): Get-Cluster | fl *
SECURITY OVER THE WAN Encrypt inter-node communication Trade-off security versus performance SecurityLevel (default = signed communication) 0 = clear text 1 = signed (default) 2 = encrypted Site A Site B
UPDATING VM’S IP ADDRESSES ON FAILOVER Not needed if on same subnet On cross-subnet failover, if guest is… If using multiple subnets, it is easier to use DHCP in guest OS IP updated automatically DHCP A new IP Address must be configured after failover This can be scripted Static IP
DNS CONSIDERATIONS Nodes in dissimilar subnets VM obtains new IP address Clients need that new IP Address from DNS to reconnect DNS Server 1 DNS Server 2 DNS Replication Record Created VM = Record Updated VM = Site A Site B Record Updated Record Obtained
SOLUTION #1: LOCAL FAILOVER FIRST Configure local failover first for high availability No change in IP addresses No DNS replication issues No data going over the WAN Cross-site failover for disaster recovery DNS Server 1 VM = Site ASite B
SOLUTION #2: STRETCH VLANS Deploying a VLAN minimizes client reconnection times IP Address of the VM never changes DNS Server 1DNS Server 2 FS = Site ASite B VLAN
SOLUTION #3: NETWORKING DEVICE ABSTRACTION Networking device uses independent 3 rd IP Address 3 rd IP Address is registered in DNS & used by client DNS Server 1 DNS Server 2 VM = Site ASite B
CSV NETWORKING CONSIDERATIONS Cluster Shared Volumes requires nodes to be in the same subnet Use a VLAN on your CSV network Other networks can still support multiple subnets Site A Site B VLAN CSV Network
LIVE MIGRATING ACROSS SITES Live migration moves a running VM between cluster nodes TCP reconnects makes the move unnoticeable to clients Use VLANs to achieve live migrations between sites The VM’s IP Address connecting the client to VM will not change Network Bandwidth Planning Live migration may require significant network bandwidth based on the amount of memory allocated to VM Live migration times will be longer with high latency or low bandwidth WAN connections Remember that CSV and live migration are independent, but complimentary, technologies
MULTI-SUBNET VS. VLAN RECAP
MULTI-SITE CLUSTERING CONSIDERATIONS Network Compute Quorum Storage
QUORUM DEPLOYMENT OPTIONS 1. Disk only 2. Node Majority 3. Node & Disk Majority 4. File Share Witness
REPLICATED DISK WITNESS A witness is a tie breaker when nodes lose network connectivity When a witness is not a single decision maker, problems occur Do not use in multi-site clusters unless directed by vendor Replicated Storage ? Vote
NODE MAJORITY Site B Site A Cross site network connectivity broken! Can I communicate with majority of the nodes in the cluster? Yes, then Stay Up Can I communicate with majority of the nodes in the cluster? Yes, then Stay Up Can I communicate with majority of the nodes in the cluster? No, drop out of Cluster Membership Can I communicate with majority of the nodes in the cluster? No, drop out of Cluster Membership 5 Node Cluster: Majority = 3 Majority in Primary Site
NODE MAJORITY Disaster at Site 1 Can I communicate with majority of the nodes in the cluster? No, drop out of Cluster Membership Can I communicate with majority of the nodes in the cluster? No, drop out of Cluster Membership Majority in Primary Site 5 Node Cluster: Majority = 3 Need to force quorum manually Site A We are down! Site B
FORCING QUORUM Forcing quorum is a way to manually override and start a node even though it has not achieved quorum Always understand why quorum was lost Used to bring cluster online without quorum Cluster starts in a special “forced” state Once majority achieved, drops out of “forced” state Command Line: net start clussvc /fixquorum (or /fq) PowerShell (R2): Start-ClusterNode –FixQuorum (or –fq) Command Line: net start clussvc /fixquorum (or /fq) PowerShell (R2): Start-ClusterNode –FixQuorum (or –fq)
MULTI-SITE WITH FILE SHARE WITNESS Site ASite B Site C (branch office) Complete resiliency and automatic recovery from the loss of any 1 site \\Foo\Share WAN File Share Witness
CHANGES IN SERVICE PACK 1 Node Vote Weight Granular control of which nodes have votes in determining quorum Flexibility for multi-site clusters Prefer primary Site during network split Complete failure of Backup Site will not bring down the cluster Primary Site Backup Site Cluster.exe Cluster.exe. node /prop NodeWeight=0 PowerShell (R2): (Get-ClusterNode “NodeName”).NodeWeight = 0 Cluster.exe Cluster.exe. node /prop NodeWeight=0 PowerShell (R2): (Get-ClusterNode “NodeName”).NodeWeight = 0
CHANGES IN SERVICE PACK 1 Prevent Quorum Admin Started Backup Site with /ForceQuorum option When Primary is restarted N1 & N2 overwrites the Authoritative Cluster configuration Changes Made by B3 & B4 overwritten When Primary is started with /Prevent Quorum – recomm. Quorum override avoided Changes Made by B3 &` B4 are maintained N1 & N2 gracefully joins the existing membership
QUORUM MODEL RECAP Even number of nodes Best availability solution – FSW in 3rd site Node and File Share Majority Odd number of nodes More nodes in primary site Node Majority Use as directed by vendor Node and Disk Majority Not Recommended Use as directed by vendor No Majority: Disk Only
MULTI-SITE CLUSTERING CONTENT Design guide: Deployment guide/checklist:
VPRAŠANJA? Po zaključku predavanja prosim izpolnite vprašalnik. Vprašalniki bodo poslani na vaš e-naslov, dostopni pa bodo tudi preko profila na spletnem portalu konference. Z izpolnjevanjem le tega pripomorete k izboljšanju konference. Hvala!