Presentation is loading. Please wait.

Presentation is loading. Please wait.

Troubleshooting Availability Group Failovers

Similar presentations


Presentation on theme: "Troubleshooting Availability Group Failovers"— Presentation transcript:

1 Troubleshooting Availability Group Failovers
Sourabh Agarwal Sr. Program Manager Microsoft Data Platform Group Sourabh Agarwal Sr. Program Manager Microsoft Data Platform Group

2 MICROSOFT CONFIDENTIAL
C:\Users>WHOAMI Sourabh Agarwal works as a Program Manager for the Microsoft Data Platform Group and owns the HADR components for SQL Server In-Market release. In his decade long stint at Microsoft, he has worked in different capacities and specializes in providing consulting services on SQL Server and other Data Platform technologies to Microsoft customers across business domains and geographies. His specialization includes Designing and Optimizing SQL Deployments, HADR, Microsoft Azure, and PowerShell Scripting. Twitter Blog – aka.ms/sql_server_team Team Twitter HADR in SQL Server is a suite of technologies used by customers to achieve 5 9’s of availability. First part of the session will discuss how to position it against our competition (e.g. Oracle RAC). Second part will go over the new scenarios (including diagnostics/troubleshooting) available in SQL Server To round up the discussion we describe talk about some of the new scenarios/capabilities being introduced in SQL Server V-Next for Always On and Transaction replication. Speaker: Learning Series MICROSOFT CONFIDENTIAL

3 Agenda Understanding Lease Renewal and Health Detection
11/18/ :22 AM Agenda Understanding Lease Renewal and Health Detection Always On Failover Tools or Logs for troubleshooting © 2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

4 Always On Lease Renewal Process
11/18/ :22 AM Always On Lease Renewal Process Every ¼ of the Lease Timeout Setting Simple handshake process between SQL and Resource DLL Runs outside of SQLOS and at a higher priority Enabled at Failover_Condition_Level 1 Resource DLL Lease Thread Server Lease Worker/Thread SetEvent(Client Event) WaitForSingleObject(Client Event) WaitForSingleObject(Server Event) SetEvent(Server Event) Wait for 1/4 the Lease Timeout Repeat Loop Until Shutdown or Lease Expires 3 MINS © 2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

5 Always On Health Check Initiated by the WSFC Resource DLL running in RHS.exe Executes sp_server_diagnostics against the SQL Server Sp_server_diagnostics runs every 1/3rd of Health Check Timeout setting Two scenarios under could lead to failover Failure to return sp_server_diagnostics within the Health Check Timeout internal (default 30 secs) Output of sp_server_diagnostics returning an error – Failover Condition level 3-5 3 Mins

6 Sp_Server_Diagnostics
11/18/ :22 AM Sp_Server_Diagnostics Repeat interval of 1/3rd of HealthCheckTimeout setting Persistent connection to SQL Server Preemptive thread running on high Priority Checks for errors in multiple components System Resource query_processing io_subsystem Events Always On Availability Group Ensure to talk about the Gap in the timelines. System – Collects data from a system perspective on spinlocks, severe processing conditions, non-yielding tasks, page faults, and CPU usage. Resource – Collects data from a resource perspective on physical and virtual memory, buffer pools, pages, cache and other memory objects. Query Processing – Collects data from a query-processing perspective on the worker threads, tasks, wait types, CPU intensive sessions, and blocking tasks. IO Subsystem – Collects data on IO. In addition to diagnostic data. Events – Collects data and surfaces through the stored procedure on the errors and events of interest recorded by the server, including details about ring buffer exceptions, ring buffer events about memory broker, out of memory, scheduler monitor, buffer pool, spinlocks, security, and connectivity . The resource DLL calls the sp_server_diagnostics stored procedure and sets the repeat interval to one-third of the HealthCheckTimeout setting. If the sp_server_diagnostics stored procedure is slow or is not returning information, the resource DLL will wait for the interval specified by HealthCheckTimeout before it reports to the WSFC service that the SQL instance is unresponsive. If the dedicated connection is lost, the resource DLL will retry the connection to the SQL instance for the interval specified by HealthCheckTimeout before it reports to the WSFC service that the SQL instance is unresponsive. © 2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

7 Always On Failover Failovers can be divided into two categories Automatic Failover configured for AG, but failover was unsuccessful Databases in AG are not synchronized Cluster configuration issues Insufficient permissions for NT Authority/System Unexpected Automatic failover SQL Server Always On detects health issue Lease Timeout (System Wide Issues) Health Check Timeout Database Level Health Check (if enabled) Windows Cluster detects health issue (e.g. Loss of Cluster Quorum) 5 Mins

8 Troubleshooting Tools
Always On Health Session SQL Server Error Logs Cluster Logs Failover Cluster Diagnostics Logs System Event Logs

9 Always On Health Session
11/18/ :22 AM ALTER SERVER CONFIGURATION SET DIAGNOSTICS LOG ON; Always On Health Session XEvents session for Always On Health 4 files of max 5 MB each Enabled by default Events Captured availability_replica_manager_state_change availability_replica_state error_reported availability_replica_state_change Lock_redo_blocked Avaliability_group_lease_expired Availability_replica_automatic_failover_validation hadr_db_partner_set_sync_state © 2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

10 Failover Cluster Diagnostics Logs
11/18/ :22 AM Failover Cluster Diagnostics Logs Information captured by sp_server_diagnostics on Primary replica 10 files of max 100 MB each Events Captured availability_group_state_change info_message availability_group_is_alive_failure component_health_result ALTER SERVER CONFIGURATION SET DIAGNOSTICS LOG ON ALTER SERVER CONFIGURATION SET DIAGNOSTICS LOG OFF © 2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

11 Lease Timeout Improvements
11/18/ :22 AM Lease Timeout Improvements New descriptive Error messages for Lease Timeout New Lease States reported by the Extended Events availability_group_lease_expired hadr_ag_lease_renewal System Performance objects in Cluster Logs        // ErrorNumber: 19419        // ErrorSeverity: EX_USER        // ErrorInserts: Availability group name        // ErrorFormat: The renewal of the lease between availability group '%.*ls' and the Windows Server Failover Cluster failed because the existing lease is no longer valid.        // ErrorCause: The lease worker on the sql server side did not get scheduled on time to process event signal from the cluster.        // ErrorCorrectiveAction: Check the CPU utilization on the server as sql server lease worker seems to be starving.        // ErrorOwner: fasedrak        // ErrorFirstProduct: SQL11        const int HADR_AG_LEASE_EXPIRED_WAITING_FOR_RENEW = 19;        // ErrorNumber: 19420        // ErrorFormat: The availability group '%.*ls' is explicitly asked to stop the lease renewal.        // ErrorCause: The lease renewal is stopping as a part of bringing the availability group offline.        // ErrorCorrectiveAction:        const int HADR_AG_LEASE_STOPPED = 20;        // ErrorNumber: 19421        // ErrorFormat: The renewal of the lease between availability group '%.*ls' and the Windows Server Failover Cluster failed because renewal didn't happen within lease interval.        // ErrorCause: The lease helper on the cluster side did not signal the sql server lease worker on time.        // ErrorCorrectiveAction: Check corresponding availability group resource in WSFC cluster to see if it reported any error.        const int HADR_AG_LEASE_RENEWAL_TIMEOUT = 21;        // ErrorNumber: 19422        // ErrorFormat: The renewal of the lease between availability group '%.*ls' and the Windows Server Failover Cluster failed because of a windows error with Error code ('%d').        // ErrorCause: The lease worker on sql server side failed to renew the lease because of a windows error.        // ErrorCorrectiveAction: Check windows error code and take the corrective action.        const int HADR_AG_LEASE_RENEWAL_FAILED_DUE_WINDOWS_ERROR = 22; 43#define HADR_LEASE_WORKER_STATES_LIST(ACTIONS) \ 44 ACTIONS (LeaseWorkerStarted, 0), \ 45 ACTIONS (StartedExcessLeaseSleep, 1), \ 46 ACTIONS (FailedExcessSleepInvalidOnlineLease, 2), \ 47 ACTIONS (SkipExcessSleep, 3), \ 48 ACTIONS (ExcessSleepSucceeded, 4), \ 49 ACTIONS (RenewSucceeded, 5), \ 50 ACTIONS (LeaseNotValid, 6), \ 51 ACTIONS (StopLeaseRenewal, 7), \ 52 ACTIONS (LeaseExpired, 8), \ 53 ACTIONS (FailedWithWindowsError, 9) © 2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

12 Unsuccessful Failover
Demo XE output for Lease Timeout Cluster Log Output -- Search for Phrase “Lease Timeout Detected” SQL Error Log output

13 Troubleshooting Lease Timeout
Demo XE output for Lease Timeout Cluster Log Output -- Search for Phrase “Lease Timeout Detected” SQL Error Log output

14 Always On Unexpected Failover
Demo

15 Failover Analysis Utility
Demo

16 Bookmarks SQL Server Tiger Team Blog http://aka.ms/sqlserverteam
Tiger Toolbox GitHub SQL Server Release Blog BP Check SQL Server Standards Support Trace Flags SQL Server Support lifecycle SQL Server Updates Twitter @mssqltiger 1 Mins © 2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

17 11/18/ :22 AM © 2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.


Download ppt "Troubleshooting Availability Group Failovers"

Similar presentations


Ads by Google