unexplained AG failover

Slides:



Advertisements
Similar presentations
TDPS Wireless v Enhancements E1 - Multi load E2 - Driver time scheduler.
Advertisements

Module 20 Troubleshooting Common SQL Server 2008 R2 Administrative Issues.
FlareCo Ltd ALTER DATABASE AdventureWorks SET PARTNER FORCE_SERVICE_ALLOW_DATA_LOSS Slide 1.
1 - Oracle Server Architecture Overview
Virtual techdays INDIA │ September 2011 High Availability - A Story from Past to Future Balmukund Lakhani │ Technical Lead – SQL Support, Microsoft.
70-293: MCSE Guide to Planning a Microsoft Windows Server 2003 Network, Enhanced Chapter 7: Planning a DNS Strategy.
Maintaining and Updating Windows Server 2008
SQL Server 2012 Always On Premier Field Engineer Microsoft Corporation Lisa Gardner
SharePoint Business Continuity Management with SQL Server AlwaysOn
Architecting Availability Groups
SQLintersection SQL37 SQL Server 2012 Availability Groups: High Availability for Your Most Important Data Aaron Bertrand,
SQLCAT: SQL Server 2012 AlwaysOn Lessons Learned from Early Customer Deployments Sanjay Mishra Program Manager Microsoft Corporation DBI360.
SQLintersection Session SQL37 SQL Server 2012 Availability Groups Aaron Bertrand
Unified solution Easy to configure, manage, and monitor Reuse existing investments SAN/DAS environments Allow using HA hardware resources Fast seamless.
SQLCAT: SQL Server HA and DR Design Patterns, Architectures, and Best Practices Using Microsoft SQL Server 2012 AlwaysOn Sanjay Mishra Program Manager.
Week 3 Lecture 1 The Redo Log Files and Diagnostic Files.
Speaker Name 00/00/2013. Solution Requirements.
High Availability in DB2 Nishant Sinha
Christian Bolton SQL11 What’s Coming.
Alwayson Availability Groups
Log Shipping, Mirroring, Replication and Clustering Which should I use? That depends on a few questions we must ask the user. We will go over these questions.
Narasimha Reddy Gopu Jisha J. Agenda Introduction to AlwaysOn * AlwaysOn Availability Groups (AG) & Listener * AlwaysOn Failover * AlwaysOn Active Secondaries.
SQL Server High Availability Introduction to SQL Server high availability solutions.
SQL Advanced Monitoring Using DMV, Extended Events and Service Broker Javier Villegas – DBA | MCP | MCTS.
All the things you need to know before setting up AlwaysOn Michael Steineke SQL & BI Solution Lead Enterprise Architect Concurrency, Inc.
FUN WITH AVAILABILITY GROUPS Christopher Wolff SQL Server Database Engineer, Xero.
Architecting Availability Groups An analysis of Microsoft SQL Server Always-On Availability Group architectures 1.
SQL 2012 – Always On Deep Dive Bob Duffy Database Architect Prodata SQL Centre of Excellence 11 th April 2013.
Level 400 SQL Server 2012 AlwaysOn Deep Dive Christian Bolton, Coeo Ltd.
Finding root cause for unexplained AG Failover Trayce Jordan, Sr. Support Escalation Engineer, Microsoft October 3, 2015.
SQL Database Management
Introduction to Clustering
Fundamental of Databases
High Availability - SQL Cluster
Turgay Sahtiyan Istanbul, Turkey
Sponsors.
Lesson 19: Configuring and Managing Updates
SQL Server AlwaysOn Availability Groups DrillDown
AlwaysON Availability groups
SQL Server 2012 AlwaysOn and SQLSentry
Architecting Availability Groups
ALWAYSON AVAILABILITY GROUPS
Disaster Recovery Where to Begin
AlwaysOn Mirroring, Clustering
70-293: MCSE Guide to Planning a Microsoft Windows Server 2003 Network, Enhanced Chapter 6: Planning, Configuring, And Troubleshooting WINS.
Always On Multi-Site Patterns
Always On Availability Groups
Always on HA SQL Server Always ON feature is the new comprehensive high availability and disaster recovery solution which increases application availability.
Contained DB? Did it do something wrong?
Required 9s and data protection: introduction to sql server 2012 alwayson, new high availability solution Santosh Balasubramanian Senior Program Manager.
Always On : Multi-site patterns
AlwaysOn Availability Groups 101
Auditing in SQL Server 2008 DBA-364-M
Introduction to Clustering
Architecting Availability Groups
Troubleshooting Availability Group Failovers
Oracle9i Database Administrator: Implementation and Administration
Always On : Multi-site patterns
What’s new in SQL Server 2016 Availability Groups
What's New in the World of High Availability for DB2 in 11.1
Planning High Availability and Disaster Recovery
Always On : Multi-site patterns
unexplained AG failover
AlwaysOn Availability Groups
High Availability/Disaster Recovery Solution
Distributed Availability Groups
Troubleshooting AlwaysOn Availability Groups
Overview Multimedia: The Role of WINS in the Network Infrastructure
04 | Always On High Availability
Designing Database Solutions for SQL Server
Presentation transcript:

unexplained AG failover Finding root cause for unexplained AG failover Trayce Jordan MCM, MCA, MCITP, MCTS, MCDBA, MCSD, CISSP Senior Premier Field Engineer - SQL Microsoft Corporation Trayce.Jordan@Microsoft.com Trayce@SeekWellAndProsper.com @SeekWellDBA http://seekwellandprosper.com

Do these quotes sound familiar? “My AG just failed over – why?” “My AG didn’t failover – why not?” “I know where to look, but it doesn’t make any sense!” “I don’t know how to figure it out!”

Our Agenda Look at logs! Discuss most common issues for failover. Review the SQL/Cluster components. Share my root cause analysis (RCA) approach. Look at logs!

Most common causes for failover Quorum loss Lease timeout HealthCheck timeout SQL Dumps QUORUM LOSS ============== Database mirroring had Witness. AG “replaced” the witness with WSFC. It’s good at arbitration. Because of SQL depends on QUORUM. QUORUM is in the eye of the beholder. LEASE TIMEOUT ========== SQL Sets up a lease between the primary AG and the “cluster”. If that lease is broken, the AG must go down. HEALTH CHECK timeout ================ We replaced health checks of @@ServerName with sp_server_diagnostics. If that “fails” or times out, we’ll shut down or failover. SQL DUMPS =========== During a memory dump, we freeze all of our threads. The “lease renewal thread” will faile to respond and cause a lease timeout. User Initiated ========= Either from the WSFC Cluster Manager, or from SQL Server User initiated

Most common causes for not failing over One or more DBs not sync’d Secondary not connected WSFC cannot connect to SQL AG set for manual failover Databases not sync’d ============== Must be SYNCHRONIZED. There are some conditions based on timing that can cause the databases not to be sync’d. PG is working on trying to eliminate that condition. -- it is rare though. SECONDARY not connected ================== If the secondary is not connected to the primary prior to failover, we will not failover to it. WSFC cannot connect to SQL ==================== In order to bring up the AG, the cluster must connect to SQL and have permissions to bring the AG online. AG is set for manual failover =================== if you try to use WSFC and set for manual failover, it won’t failover because we manage the possible ownerships. A manual secondary will not have possible ownership set. Exceeded Failover thresholds Cluster settings for failover attempts, failure threshold window, retry attempts – all can affect failover. Exceeded failover thresholds

Linux version will be different SQL/Cluster architecture & interactions AlwaysOn AGs requires & depends on WSFC. Linux version will be different In SQL v-next The RHS.EXE process monitors SQL health. The RHS.EXE process maintains a “lease” with SQL Server on the AG primary. A separate RHS.EXE process can be used for each resource in a cluster. We recommend a separate RHS process for each availability group – if you have many AGs or many databases in AGs. The lease process is explained in further detail on slide 10. Reference: http://blogs.msdn.com/b/psssql/archive/2012/09/07/how-it-works-sql-server-alwayson-lease-timeout.aspx If the cluster service stops on the AG primary, the AG goes offline.

The Resource Control Manager RCM is the thread within Cluster Service responsible for resources. RHS.EXE is a separate process in charge of testing. LooksAlive every 5 seconds IsAlive every 60 seconds

RHS Interacts with SQL SQL Server 2012/2014/2016 Resource DLL sp_server_diagnostics Diagnostics When the resource control monitor starts the AG resource, the SQL AG resource dll is loaded into RHS.EXE. It then makes an ODBC connection to SQL Server and initiates a call to sp_server_diagnostics. It is set to receive data every 1/3 of the “health check timeout” interval. The RHS.EXE process (through the resource DLL), makes a persistent connection for each Availability Group. It does not disconnect/reconnect normally –though if SQL Server kills its session or some other issue takes place – a new connection may be established. The sp_server_diagnostics stored procedure is not actually called “multiple times” – it is called once – when the connection is made. It is passed a parameter that specifies how often data should be returned to the RHS process for evaluation. https://msdn.microsoft.com/en-us/library/hh710040.aspx SQL Server

sp_server_diagnostics Flexible Failure Conditions 5 – Failover or restart on any qualified failure conditions Query Processing errors 4 – Failover or restart on moderate SQL Server errors Resource errors - OOM 3 – Failover or restart on critical SQL Server errors System errors All levels are cumulative – meaning that level 2 includes all level 1 checks. Level 4 has all of Levels 3, 2, & 1 as well. Level 1: Just checks to see if the SQL Server service is running. If so, it returns true for the “IsAlive” check. Back in the SQL 2008R2 and earlier time frames the “LooksAlive” check just checked to see if the service was running. If it was, then it did an “IsAlive” check to see if SQL could return data – that was the “IsAlive” check – and it sent a query of SELECT @@SERVERNAME. Beginning with SQL Server 2012, the “LooksAlive” and “IsAlive” both return TRUE if the SQL Service is running – when the flexible failure condition is set at Level 1. A true “IsAlive” check is not done until Level 2. Level 2: fails if the sp_server_diagnostics doesn’t return anything or if the data returned is corrupted somehow. it does not “check the health” of what is returned – simply – “did I get data back?” Essentially this is equivalent to the “IsAlive” check in the past for SQL FCIs – 2008R2 and earlier - SELECT @@SERVERNAME Level 3: “System Errors” are: Check for too much dumping (>= 100 dumps & more recent interval check >= 2 more dumps) Memory scribbler present (buffer overrun) Orphaned spinlock Level 4 “Resource Errors” Essentially OOM – that we haven’t freed any memory in more than 2 minutes: Level 5 “Query Processing” errors are: Unresolved deadlocks Deadlocked Schedulers 2 – Failover or restart on server unresponsive sp_server_diagnostics failure or timeout 1 – Failover or restart on SQL service failure Service down

Two-way “Handshake lease” Both the RHS & SQL Server must respond to each other. When one updates the Shared memory object, it triggers the other to ‘respond’. The default lease timeout is 20 seconds. Every 1/4th of the lease timeout setting, one process should be updating the shared memory triggering the other to respond. If the other side responds, the lease timeout setting is reset and the process starts over. So when things are working properly, by default, every 5 seconds a new 20 second countdown timer is started. Reference: http://blogs.msdn.com/b/psssql/archive/2012/09/07/how-it-works-sql-server-alwayson-lease-timeout.aspx

Review AlwaysOn Health *.XEL files Look for failover DDL events Look for lease timeout events

Review AlwaysOn Health *.XEL files Look at all state changes to get timelines

Correlate to SQL & Cluster Logs

Cluster Log Anatomy

Demos

References Appendix A: Details of How Quorum Works in a Failover Cluster http://technet.microsoft.com/en-us/library/cc730649(v=ws.10).aspx Force Quorum in a Single-Site or Multi-Site Failover Cluster http://technet.microsoft.com/en-us/library/dd197500(v=WS.10).aspx Tuning Failover Cluster Network Thresholds http://blogs.msdn.com/b/clustering/archive/2012/11/21/10370765.aspx Configure Heartbeat and DNS Settings in a Multi-Site Failover Cluster http://technet.microsoft.com/en-us/library/dd197562(v=WS.10).aspx

References LooksAlive and IsAlive Implementation of Availability Groups failure_condition_level http://blogs.msdn.com/b/alwaysonpro/archive/2013/09/12/looksalive-and-isalive-implementation-of-availability-groups.aspx Configure the Flexible Failover Policy to Control Conditions for Automatic Failover (AlwaysOn Availability Groups) http://msdn.microsoft.com/en-us/library/hh710040(v=sql.120).aspx How It Works: SQL Server AlwaysOn Lease Timeout http://blogs.msdn.com/b/psssql/archive/2012/09/07/how-it-works-sql-server-alwayson-lease-timeout.aspx Enhance AlwaysOn Failover Policy to Test SQL Server Responsiveness http://blogs.msdn.com/b/alwaysonpro/archive/2014/10/13/enhance-alwayson-failover-policy-to-check-for-connection-and-availability-database-health.aspx

Thank you! Questions?