Troubleshooting AlwaysOn Availability Groups

Troubleshooting AlwaysOn Availability Groups
Chirag Shah Premier Field Engineer

Thank you! Event Sponsors
4/25/ :00 AM Thank you! Event Sponsors © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Familiarity with Setting up and Deploying AlwaysOn Availability Groups
Prerequisites: Familiarity with Setting up and Deploying AlwaysOn Availability Groups This is NOT a beginner level session Legal disclaimer: content, ideas and opinions stated are my own and not my employer

Agenda: Why did my availability group failover?
Why didn’t my availability group failover? Setup availability groups: Create Availability Group Fails With Error 'Failed to join the database’ Create Listener Fails 'The WSFC cluster could not bring the Network Name resource online’

Availability Groups: Various Logs
SQL Server error log(s) Windows Cluster log(s) Cluster Event logs SQL Server Failover Cluster Instance Diagnostic Logs (Xevent) AlwaysOn_health logs (Xevent) System_health logs (Xevent) System and application event logs Lease or Health Check Timeout

Cluster Diagnostic Logs i.e. sp_server_diagnostics results
10 files 100 MB each Files are stored in SQL Server LOG directory Stored on SQL Server hosting currant primary replica* XEVENT files that can be opened using management studio You can use following query to shred that log SELECT AS 'Name' AS 'Package' AS 'Time' AS 'State' AS 'State Description' AS 'Failure Conditions' AS 'Node_Name' AS 'Instance Name' time'']/value)[1]','datetime') AS 'Creation Time' AS 'Component' AS 'Data' AS 'Info' FROM ( SELECT object_name AS 'event' ,CONVERT(xml,event_data) AS 'xml_data' FROM sys.fn_xe_file_target_read_file('C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\Log\SQLNODE1_MSSQLSERVER_SQLDIAG_0_ xel',NULL,NULL,NULL) ) AS XEventData ORDER BY Time;

AlwaysOn Common Customer Scenarios
OR Why did my Availability Group Failover? Why didn’t my availability group failover?

Who initiates automatic failover?
Why did my Availability Group Failover? Lease Timeout main reasons for AG failover Windows Cluster -- Health Detection (Network Issues, Cluster Node Down) Health Check Timeout New in SQL 2016/ Database Status <> ‘ONLINE’

Failover Due to Windows Cluster
Windows Cluster detects heartbeat issue between nodes and fails over the availability group 1135 System Event Log Event_NODE_DOWN Review the cluster log INFO [IM] got event: Remote endpoint xxx.xxx.x.xx:~3343~ unreachable from xxx.xxx.x.xx:~3343~ INFO [IM] Marking Route from xxx.xxx.x.xx:~3343~ to xxx.xxx.x.xx:~3343~ as down

Automatic Failover Detection (Current Primary Replica)
Lease Timeout: Signaling mechanism between resource DLL (HADRRES.DLL) and SQL Server. Default: 1/4 of LeaseTimeout settings in Cluster which is by default every 5 seconds. Health Check Timeout: The AlwaysOn health DLL (HADRRES.DLL), running in RHS.EXE has a local ODBC connection to SQL Server and expects to receive sp_server_diagnostics results back within the availability group's HEALTH_CHECK_TIMEOUT property, by default, which is 30 seconds. Find out local “ODBC” connection --Host Process_id is RHS connecting to SQL SELECT program_name, s.host_process_id, * FROM sys.dm_exec_requests as r inner join sys.dm_exec_sessions as s on r.session_id = s.session_id where PROGRAM_NAME = 'Microsoft® Windows® Operating System' AND last_wait_type = 'SP_SERVER_DIAGNOSTICS_SLEEP' 1 Cluster service sends LooksAlive 2 sp_server_diagnostics results returned to Resource DLL 3 Resource DLL processes results, detects ERROR. Notify Cluster service. 4 Cluster service issues Offline to SQL Server 5 Execute sp_availability_group_command_internal takes databases offline

Concept of Lease – Availability Groups Why?
Server A Server B Server C HR DB HR DB AG_HR Primary Primary Secondary When a partition happens on a 3 node majority cluster (say A, B, C); assume B and C form one partition and A alone is in one partition. B and C form a quorum set while there is no quorum for A. Since there is no quorum, Cluster calls Terminate for all the resources that are currently online on A and calls Online for these resources on either B or C. Cluster expects the Terminate call to succeed on A like before. In the case that it doesn’t succeed the availability group remains as primary on A and since cluster calls Online on either B or C, this leads to split brain scenario. Assume two nodes A and B and an availability group that was created with node A as primary and B as secondary. At a later time we issued a Failover command to make node B as the primary. This involved issuing MoveGroup operation of the cluster and resulted in the following sequence of events. As part of MoveGroup operation offline command was issued on node A. The stored procedure to bring AG offline didn’t succeed and timed out. We reported failure to the cluster about the failed operation. Cluster invoked terminate command due to failed offline. As part of terminate we retried the offline command which timed out again. Since we cannot report failure to cluster during terminate call we simply return at this point. As Terminate returned to cluster on node A, cluster assumes AG is Offline on node A. As second part of MoveGroup cluster calls online on node B. As part of this online operation on node B Resource DLL calls the stored procedure to bring the AG Online. This operation succeeds and SQL Server reports SUCCESS to the Resource DLL. SQL Server on Node B assumes it is the primary. Resource DLL reports success to the cluster and cluster mark node B as owner for the availability group. Note that Resource DLL on Node B doesn’t have any knowledge of node A. It simply follows the instructions from the cluster. That sounds to me like a cluster bug. [ktamma] Added box on Terminate behavior and explained why this is not a bug but a feature

If you have multiple AGs, is there a separate “lease” for each?
Availability Group “Lease” “Lease” is maintained between Windows Cluster and Primary Replica hosting the availability group. If you have multiple AGs, is there a separate “lease” for each? Ensure SQL Server is responsive “Lease” provides additional protection mechanism to avoid split brain condition. Two way handshake Uses a preemptive thread which runs at priority (not a SQL Server worker thread) Find out local “ODBC” connection --Host Process_id is RHS connecting to SQL SELECT program_name, s.host_process_id, client_interface_name, login_name, * FROM sys.dm_exec_requests as r inner join sys.dm_exec_sessions as s on r.session_id = s.session_id where PROGRAM_NAME = 'Microsoft® Windows® Operating System' AND last_wait_type = 'SP_SERVER_DIAGNOSTICS_SLEEP' ® Windows® Operating System'

Availability Group “Lease Timeout”
Lease is renewed ¼ of Lease Timeout Interval ~ 5 seconds in a default configuration. If more than 20 seconds “Lease” is not renewed HADRRES.dll part of RHS.exe reports an error to windows cluster. Windows Cluster will proceed with taking a corrective action at that time. What causes Lease Timeout Overall System Performance Degradation e.g. Working Set Trim. SQL Server generating a memory dump i.e. Access Violation or Deadlock Schedulers 100 Percent CPU utilization for a sustain period of time.

Demo Availability Groups Lease Timeout

Lease Timeout Error Numbers and Messages Error Error Message Cause
Corrective Action 19407 The lease between availability group <ag> and the Windows Server Failover Cluster has expired. Generic Lease Timeout Message. Still accompanied by other messages System Performance Degradation SQL Server Dump Diagnostics 19419 The renewal of the lease between availability group '%.*ls' and the Windows Server Failover Cluster failed because the existing lease is no longer valid. The lease worker on the SQL Server side did not get scheduled on time to process event signal from the cluster. Lease timeouts needs investigation on the SQL Server side. 19421 The renewal of the lease between availability group '%.*ls' and the Windows Server Failover Cluster failed because renewal didn't happen within lease interval. The lease helper on the cluster side did not signal the SQL Server lease worker on time. Check corresponding availability group resource in WSFC cluster to see if it reported any error. 19422 The renewal of the lease between availability group '%.*ls' and the Windows Server Failover Cluster failed because of a windows error with Error code ('%d'). The lease worker on SQL Server side failed to renew the lease because of a windows error. Check windows error code and take the corrective action. We execute the stored procedure sp_availability_group_command_internal over an ODBC connection to inform the SQL server about the Online and Offline operations. This has the possibility of failure for a variety of reasons. Some examples include network issues, worker thread unavailability at SQL Server to process the command.

Lease Timeout: Default is 20 seconds so lease renewal occurs every ¼ of that timeout value.
Health Check Timeout: Default value is 30 seconds so health detection through sp_server_diagnostics occurs every 1/3 of that timeout value. Failure Condition Level: Condition ranges from 1 to 5 where 1 is high level to 5 which is very granular. A given condition level encompasses all lower condition levels. So for e.g. default is 3 it will include both 1 and 2. WSFC resource DLL of the availability group performs a health check of the primary replica by calling the sp_server_diagnostics stored procedure on the instance of SQL Server that hosts the primary replica. sp_server_diagnostics returns results at an interval that equals 1/3 of the health-check timeout threshold for the availability group. The default health-check timeout threshold is 30 seconds, which causes sp_server_diagnostics to return at a 10-second interval. If sp_server_diagnostics is slow or is not returning information, the resource DLL will wait for the full interval of the health-check timeout threshold before determining that the primary replica is unresponsive. If the primary replica is unresponsive, an automatic failover is initiated, if currently supported.

sp_server_diagnostics
Flexible Failover Policy 5 – Failover or restart on any qualified failure conditions Query Processing errors 4 – Failover or restart on moderate SQL Server errors Resource errors - OOM 3 – Failover or restart on critical SQL Server errors System errors 1- SQL Service is down, LEASE 2-- 3- Orphaned Spinlocks, ACCESS VIOLATION, SQL internal errors 4– out of memory errors 5– Worker Thread Exhaustion or schedulers deadlocks 2 – Failover or restart on server unresponsive sp_server_diagnostics failure or timeout 1 – Failover or restart on SQL service failure Service down

sp_server_diagnostics results
Starting with SQL 2012 instead of using SELECT Windows Cluster RHS.exe uses sp_server_diagnostics system stored procedure to capture diagnostic data and potential failover. The stored procedure exist in all SQL Server edition including standalone and clustered instances. it can detect SQL Server internal errors like worker thread exhaustion, persistent OOM condition in internal resource pool, orphaned spin-locks sp_server_diagnostics HADRRES.dll within RHS.exe establishes a local ODBC connection to SQL Server instance using credentials of cluster service account (default is local system account, NT Authority\SYSTEM) It invokes it just one time and sp_server_diagnostics runs into a loop (repeat) every 1/3 of the health check timeout meaning every 10 seconds.

Why Automatic Failover did not occur?
One or more DB in AG not in sync state AG set with manual failover or Async Replica Exceeded failover threshold Troubleshooting Automatic Failover in TO SIMULATE UNSUCCESSFUL AUTOMATIC FAILOVER 1 Create an availability group with two databases. 2 Suspend synchronization with one of the availability databases hosted on one of the secondary SQL Servers which is part of the automatic failover pair. 3 Create changes to a table in each availability database on the SQL Server hosting the primary replica. 4 Shut down SQL server hosting the primary. 5 Connect to each of the secondary replicas and perform the query above to check state information, latest lsn and commit times. WSFC cannot connect to SQL Server Secondary not connected

Why Automatic Failover did not occur?
Maximum Failover Threshold Number of nodes in the cluster (n-1) In a two node cluster, a cluster resource (e.g. AG) can automatically failover 1 time every 6 hours. Maximum Failover Threshold

Availability Groups: Troubleshooting Configuration and Setup

Error 35250 Failed to join database to the availability group
Step:1 Make sure database mirroring endpoint got created and started. SELECT tep.name as EndPointName, sp.name As CreatedBy,tep.type_desc, tep.state_desc, tep.port FROM sys.tcp_endpoints tep inner join sys.server_principals sp on tep.principal_id = sp.principal_id WHERE tep.type = 4

Step:2 Firewall NOT blocking inbound port on which AlwaysOn Mirroring Endpoint is listening By default when you create an AG it uses 5022 database mirroring endpoint. Make sure Firewall or sometimes Antivirus not blocking on that endpoint a) Endpoint Listening? (Telnet "Telnet name EndPointPort" or "Telnet IP EndPointPort" for example if the Endpoint is defined as Server1 (with IP ) on Port 5022 then try: Telnet Server and/or Telnet If the Endpoint is listening, then you should receive a blank screen. If not, you will receive an error from Telnet in trying to connect. If it works with IP but not NAME, there could be a DNS / name resolution error of some sort. If it works by NAME and NOT by IP, then there could be more than one endpoint on that server (another SQL instance?) that is listening on that port -- so that EVEN THOUGH the status of the endpoint on the instance in question shows "STARTED" another instance may actually have the binding and prevent the instance in question from actually listening and establishing TCP connections. If it doesn't connect at all with Telnet, look for Firewall and/or Anti-virus products such as McAfee HIPS or Norton -- "standard" SQL connection troubleshooting elimination but for the endpoint port in question

Step: 3 SQL Server Service Startup Account has NOT be given connect permission on the endpoint Make sure domain account is used as SQL Server service startup and it has connect permission on the endpoint on all the nodes participating in availability groups. Run the following query to get list of accounts that have connect permission to the Endpoint on the server(s) in question. SELECT perm.class_desc, prin.name, perm.permission_name, perm.state_desc, prin.type_desc as PrincipalType, prin.is_disabled FROM sys.server_permissions perm LEFT JOIN sys.server_principals prin ON perm.grantee_principal_id = prin.principal_id LEFT JOIN sys.tcp_endpoints tep ON perm.major_id = tep.endpoint_id WHERE perm.class_desc = 'ENDPOINT' AND perm.permission_name = 'CONNECT' AND tep.type = 4

Availability Group – Listener Creation Fails
Cluster Name Object (CNO) does not have the "Create Computer object" permission in the computer container in Active Directory Users and Computers Active Directory or Windows Policy can prevent creation of new Computer Object. You cannot register the IP address in DNS because of certain problems that involve a duplicate or invalid IP address for listener. This supports KB explains few scenarios --

Failover Cluster Logs You can use Get-ClusterLog PowerShell CmdLet Windows System Event Logs Open Windows Event Viewer, under System Event Logs filter by source “FailoverClustering” This supports KB explains few scenarios --

CNO was in different OU so issue was permission related in Active Directory. This supports KB explains few scenarios --

Connection Timeout in Multi-subnet Availability Group
Cause: Not using MultiSubnetFailover connection string attribute in multi-subnet environment Availability Group Listener has IP address in each of the subnet ALL listener IP addresses are registered in DNS Default is RegisterAllProvidersIP=1 however at a time only one IP address is online.

Cause: Not using MultiSubnetFailover connection string attribute in multi-subnet environment Without the MultiSubnetFailover parameter, the client driver will try to connect sequentially to all IP addresses for the listener. Sequential connections may cause a long logon time or logon time-outs. If you use MultiSubnetFailover =True in connection string, client will try to connect to all listener IP address in parallel.

What if in a multi-subnet listener scenario, client cannot support “MultiSubnetFailover=true” attribute Change HOSTRecordTTL Reduce RegisterAllProviderIP=0 *****Import-Module FailoverClusters not needed in 2012 R2**** First run Get-ClusterResource to find AGListenername resource Get-ClusterResource ContosoRetailAG_ContosoAGListen | Set-ClusterParameter -Name RegisterAllProvidersIP -Value 0 Get-ClusterResource ContosoRetailAG_ContosoAGListen | Set-ClusterParameter -Name HostRecordTTL -Value 120

Multi-subnet AG- TransparentNetworkIPResolution
Starting with .NET and also part of Microsoft ODBC Driver 13.1 In case of DNS name returning multiple IP address an initial connection attempt to the first-returned IP address is made, but that attempt is timed-out after only 500ms, and then connection attempts to all the IP addresses are attempted in parallel. By default, TransparentNetworkIPResolution property is set to true. TransparentNetworkIPResolution in SQLClient for .NET 4.6.1 Though it’s still not perfect – as can be seen by this article: Connection timeout issue with .NET Framework – TransparentNetworkIPResolution

Availability Group Related Wait Types

Troubleshooting : Wait Types
HADR_SYNC_COMMIT Wait type indicating a delay as transaction cannot commit on primary cause it is waiting to be hardened on synchronous secondary replica. A wait type with no direct relevance with AlwaysOn. Typically indicates time it takes to hardened log block to transaction log. Investigation needed as it mostly indicates storage subsystem bottleneck. WRITELOG HADR_SYNC_COMMIT This blog explains it very well If primary generating log faster than secondary to catch up (we send a throttling a message to primary) so we start up with 2ms delay to hold (increase up to max of 20 ms) HADR_SYNCHRONIZING_THROTTLE indicates the time it takes a synchronizing secondary database to catch up with the primary database in order to transition from synchronizing state to synchronized state. This is an expected wait when a secondary database is trying to catch up with the primary database. If you are having latency issues and you see this wait type on the top, you may consider switching the secondary replica to asynchronous commit and later during off-peak hours when the estimated data loss on that secondary nears zero, you can switch back to synchronous-commit mode and it will quickly change its status to synchronized e.g. you patched a synchronous secondary replica, and it came back after few minutes, heavy transactions on primary may cause secondary to fall behind or stay behind. In this case primary puts in a specific delay (initially 2ms, but increases to 4 or 6 or 8 up to 20 ms) so that secondary can catch up and become synchronized. Wait related to synchronizing secondary database to catch up with the primary database in order to transition from synchronizing state to synchronized state. Primary puts an intentional delay so that secondary can catch up and become synchronized. HADR_SYNCHRONIZING_THROTTLE

HADR waits and synchronization
4/25/2019 SYNCHRONIZED = HEALTHY NOT SYNCHRONIZING = NOT_HEALTHY SYNCHRONIZING = PARTIALLY_HEALTHY X Secondary Primary We PULL Log blocks Transactions Transactions Transactions wait_type We send log blocks to secondary and local log async. The time waiting for local log is WRITELOG. The time waiting for secondary is HADR_SYNC_COMMIT. Wait Type that can be ignored -- HADR_CLUSAPI_CALL is very frequently seen when reviewing the wait statistics in an availability group environment. There is nothing to be worried about it even if you see this at the top of your wait statistics list. All this wait type tells us is that the SQL Server thread is waiting to switch from non-preemptive mode (scheduled by SQL Server) to preemptive mode (scheduled by Windows Server) to invoke the WSFC APIs. As you know, availability groups work very closely with the WSFC, there are many cluster APIs and activities that are being executed and it is very natural to see this one at the top of the list. HADR_LOGCAPTURE_WAIT indicates the time SQL Server is waiting for the log records to be become available. If the hardening is completely caught up and there are no log blocks that are waiting to be hardened to the transaction log on the disk, SQL Server will wait to get the next log block. So if the log scan is completely caught up, you will actually see a high value for this wait type. This is expected and does not mean that something is necessarily wrong WRITELOG and HADR_SYNC_COMMIT These are usually ignored: HADR_LOGCAPTURE_WAIT HADR_WORK_QUEUE REDO_THREAD_PENDING_WORK (secondary) HADR_CLUSAPI_CALL WRITELOG Read this blog post HADR_SYNCHRONIZING_THROTTLE 34 © 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Network Latency is causing slowness
Here are the counters that need to be collected when log send queue is observed to grow. Primary SQLServer:Databases:Log Bytes Flushed/sec SQL Server:Availability Replica > Bytes Sent to Replica/sec

Availability Groups: If it is taking longer failover
Original Primary had one or multiple long running transaction that transaction needed to be rolled back after the failover increasing recovery time. Cluster part of the failover is usually very quick Most of the time is spent in database “recovery” REDO on secondary was behind so after failover it takes some time before REDO can caught up and recovery completes High Number of VLFs Original Primary had slow CheckPoint resulting in increased recovery after failover.

Before failover – ensure Redo and hardened LSN

Shared Redo Target for Replicas
If on one of the synchronous replica REDO is behind, it will make other replica slow down Trace Flag: when enabled on the secondary ignores the redo target provided from the primary progress message and always set the redo target at the Max LSN value

SCENARIO: Quorum is lost – all nodes intact
Communication problems between nodes All nodes are intact Recover original primary

Handling Quorum Loss Recover Primary
On Original Primary Force Quorum Start node hosting original primary with /ForceQuorum Force Failover of availability group No data loss if run on primary

In Virtualized Environment if you find a high number of discarded packets
VMWARE KB

Reference: Here are the counters that need to be collected when log send queue is observed to grow. Troubleshooting AlwaysOn:

Troubleshooting AlwaysOn Availability Groups

Similar presentations

Presentation on theme: "Troubleshooting AlwaysOn Availability Groups"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Troubleshooting AlwaysOn Availability Groups

Similar presentations

Presentation on theme: "Troubleshooting AlwaysOn Availability Groups"— Presentation transcript:

Similar presentations

About project

Feedback